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Abstract 

Constrained lossy source coding and cliannel coding with side information problems which extend the classic 
Wyner-Ziv and Gel'fand-Pinsker problems are considered. Inspired by applications in sensor networking and control, 
we first consider lossy source coding with two-sided partial side information where the quality/availability of the side 
information can be influenced by a cost-constrained action sequence. A decoder reconstructs a source sequence 
subject to the distortion constraint, and at the same time, an encoder is additionally required to be able to estimate 
the decoder's reconstruction. Next, we consider the channel coding "dual" where the channel state is assumed to 
depend on the action sequence, and the decoder is required to decode both the transmitted message and channel input 
reliably. 

Implications on the fundamental limits of communication in discrete memoryless systems due to the additional 
reconstruction constraints are investigated. Single-letter expressions for the rate-distortion-cost function and channel 
capacity for the respective source and channel coding problems are derived. The dual relation between the two prob- 
lems is discussed. Additionally, based on the two-stage coding structure and the additional reconstruction constraint of 
the channel coding problem, we discuss and give an interpretation of the two-stage coding condition which appears in 
the channel capacity expression. Besides the rate constraint on the message, this condition is a necessary and sufficient 
condition for reliable transmission of the channel input sequence over the channel in our "two-stage" communication 
problem. It is also shown in one example that there exists a case where the two-stage coding condition can be active 
in computing the capacity, and it thus can actively restrict the set of capacity achieving input distributions. 

I. Introduction 

The problems of source coding with side information and channel coding with state information have received 
considerable attention due to their broad set of applications, e.g., in high-definition television where the noisy analog 
version of the TV signal is the side information at the receiver, in cognitive radio where the secondary user has 
knowledge of the message to be transmitted by the primary user, or in digital watermarking where the host signal 
plays a role of state information available at the transmitter [1], [2]. In [3] Wyner and Ziv considered rate-distortion 
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coding for a source with side information available at the receiver, while the problem of coding for channels with 
noncausal state information available at the transmitter was solved by Gel' f and and Pinsker in [4]. In practice, the 
transmitter and/or the receiver may not have full knowledge of the channel state information. Heegard and El Gamal 
in [5] studied the channel with rate-Umited noncausal state information available at the encoder and/or the decoder. 
Further, Cover and Chiang in [1] provided a unifying framework to characterize channel capacity and rate-distortion 
functions for systems with two-sided partial state information, and they also discuss aspects of duaUty between the 
source and channel coding problems. 

In this work we consider source and channel coding with two-sided partial side/state information where the 
side/state information can be influenced by other nodes in the system. Such side/state information is termed as 
action-dependent side/state information [6], [7]. Weissman studied first a problem of coding for a channel with 
action-dependent state [6], and the source coding dual was investigated by Permuter and Weissman [7] where a 
node in the system can take action to influence the quality/availability of the side information. This novel action- 
dependent coding framework introduces new interesting features to the general system model, involving cost- 
constrained communication and interaction among nodes, and is therefore highly relevant to many appUcations 
including sensor networking and control, and multistage coding for memories [6], [7]. Additional work on coding 
with action includes [8] where it is natural to consider action probing as a means for channel state acquisition, 
and in [9], [10] where the problem of source coding with action-dependent side information is extended to the 
multi-terminal case. 

In addition, we are interested in the recently introduced problem of lossy source coding with side information 
under the additional requirement that the sender should be able to locally produce an exact copy of the receiver's 
reconstruction. This requirement was introduced and termed the common reconstruction (CR) constraint by Steinberg 
[11]. The general case of additional reconstruction subject to the distortion constraint was later studied in [12]. The 
channel coding dual is also investigated in the context of information embedding by Sumszyk and Steinberg in [13] 
where the decoder is interested in decoding both an embedded message and a stegotext signal. There, it is shown that 
if the objective is to decode only the message and the stegotext (channel input signal), then decoding the message 
and the channel state first and then re-encoding the channel input is suboptimal. As with action-dependent coding, 
also the framework of additional reconstruction requirements provides new useful features of simultaneous signal 
transmission in the general system model. Recent works on common reconstruction in multi-terminal information 
theoretic problems include [14], [15]. Some closely related works on additional signal reconstruction include [16], 
[17]. 

In the present work we unify the problems of action-dependence and common reconstruction constraints by 
studying source and channel coding with action-dependent partial side information known noncausaUy at the encoder 
and the decoder, and with additional reconstruction constraints. The constrained source coding problem is an 
extension of Wyner-Ziv lossy source coding where the encoder is additionaUy required to estimate the decoder's 
reconstruction reliably and the available two-sided partial side information depends on a cost-constrained action 
sequence. This setting captures the problem of simultaneously controlUng the quaUty of the decoder's reconstruction 
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via the action-dependent side information, and monitoring the resulting performance via common reconstruction. 
As a motivating example, consider a closed-loop control system. Assuming that there exists a coding scheme which 
satisfies the CR constraint, an observer/encoder having knowledge about the reconstruction at a controller/decoder 
will have the possibiUty to compensate for possible impact of state reconstruction distortion and thus achieve better 
control performance in future time instants. The unified system modeled with both action-dependent side information 
and the CR constraint can also be viewed as a resource-efficient system, i.e., the quaUty of side information can 
be adjusted on demand and the control objective can be achieved more efficiently due to the knowledge of the 
controller's reconstruction at the observer. On the other hand, the constrained channel coding dual is an extension 
of the Gel'fand-Pinsker problem where the channel state is allowed to depend on an action sequence and the 
decoder is additionally required to reconstruct the channel input signal reliably. This setting captures the idea of 
simultaneously transmitting the message and the channel input sequence reUably over the channel. To be consistent 
with the terminology used in [13], we refer to the reconstruction constraint as the reversible input (RI) constraint. 
This setup is for example relevant in a data storage problem where a user is interested in both decoding the 
embedded message and in tracing what has been written in the previous stages. It may also be relevant in a wireless 
networking scenario where knowing the channel input signal can enable interference mitigation at some node in 
the network. 

In this work, we characterize fundamental linuts of discrete memoryless systems, and discuss the impUcation 
of additional reconstruction constraints. An investigation on the dual relationship between the problems is also of 
interest. We note that different kinds of duaUty between various source and channel coding problems with side 
information (SI) have been recognized earUer. For example, several works have discussed duality between the 
Wyner-Ziv and Gel'fand-Pinsker problems [1], [18], [19], [20]. Our definition of duality simply follows the notion 
of "formula" duaUty in [1]. Although it is not based on a strict definition like in other work, it is appeaUng that 
one nught be able to anticipate the optimal solution of a new problem from its dual problem. 

Our source and channel coding problems are "dually" formulated, i.e., an encoder in one problem has the same 
functionaUty as a decoder in the other problem. However, there are some fundamental differences in their operational 
structure. As we will show, the source coding setup requires causal processing at the encoder for compressing the 
source using action-dependent side information, while in the case of channel coding the channel decoder can 
observe the channel output and the channel state information noncausally. In addition, the channel coding scenario 
requires sequential two-stage processing at the encoder in generating an action-dependent state sequence and then 
a channel input sequence. When we impose an additional constraint on decoding a signal generated in the later 
stage (channel input X") at the decoder, an extra condition, apart from the rate constraint, is needed. This leads 
us to the conclusion that formula duality between our problems does not hold. We term the new condition which 
appears in the channel coding problem the two-stage coding condition^ since it arises essentially from the two-stage 

'After submission, we got aware of two recent works [21], [22] in which a similar two-stage coding condition appears in a similar fashion 
as an extra constraint resulted from the additional reconstruction requirements in the two-stage communication setting. 
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operational structure of the setting that requires the channel input reconstruction. In addition to the rate constraint, 
we show that the two-stage coding condition is a necessary and sufficient condition for reUable transmission of the 
channel input signal over the channel in our two- stage communication problem. We also discuss different aspects 
of the presence of the two-stage coding condition in the channel capacity problem, based on operational, source 
coding, and channel coding perspectives. Finally, we show in one of the examples that there exists a case where 
the two-stage coding condition can be active when computing the capacity, and it can thus actively restrict the set 
of capacity achieving input distributions. The material in this paper was presented in part in [23], [24], [25], and 
[26]. 

The remaining parts of the paper are organized as follows. In Section II we formulate the problem of source 
coding with action-dependent two-sided partial SI and CR constraint. We derive a closed-form expression for the 
rate-distortion-cost function. Other related results and a binary example illustrating an impUcation of conmion 
reconstruction constraint on the rate-distortion-cost tradeoff are given. The channel coding dual is presented in 
Section III, where the channel capacity is found in a form with the two-stage coding condition. In this section we 
also present other related results as well as an example showing that the two-stage coding condition can be active 
in some cases. We discuss the presence of the two-stage coding condition as well as the dual relations among the 
related problems in Section IV. The conclusion is provided in Section V. 

Notation: We denote the discrete random variables, their corresponding reahzations or deterministic values, and 
their alphabets by the upper case, lower case, and calUgraphic letters, respectively. The term denotes the 
sequence {X^, • • • , Xn} when m < n, and the empty set otherwise. Also, we use the shorthand notation X" for 
X". The term denotes the set {Xi, . . . , Xj+i, . . . , Xn}. Cardinality of the set X is denoted by \X\. 
Finally, we use X — Y — Z to denote a Markov chain formed by the joint distribution of {X, Y, Z) that is factorized 
as Px,Y,z{x,y,z) = Px,Y{x,y)Pz\Y{z\y) or Px,Y,z{x,y,z) = Px\Y{x\y)PY,z{y,z). 

II. Source Coding with Action-dependent Side Information and CR Constraint 

In this section we study source coding with action-dependent side information and CR constraint as depicted in 
Fig. 1. The side information is generated based on the source and cost-constrained action sequences, and are given 
at both encoder and decoder. The decoder reconstructs the source sequence subject to the distortion constraint. 
Meanwhile, the encoder is required to locally produce an exact copy of the decoder's reconstruction. This scenario 
captures the idea of simultaneously controlhng the quality of the decoder's reconstruction via action-dependent side 
information, and monitoring the decoder's reconstruction via common reconstruction. Our setup can be considered 
as a combination of Permuter and Weissman's source coding with side information "vending machine" [7] and 
Steinberg's coding and common reconstruction [11]. 

In the following, we present the problem formulation, characterize the main result which is the rate-distortion-cost 
function of the setting, and also present some other related results. Finally, a binary example is given to illustrate 
an implication of the common reconstruction on the rate-distortion-cost tradeoff. 
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Fig. 1. Rate distortion with action-dependent partial side information and CR constraint. 



A. Problem Formulation and Main Results 

We consider finite alphabets for the source, action, side information, and reconstruction sets, i.e., X, A, <Se, Sd, 
and X are finite. Let be a source sequence of length n with i.i.d. elements according to Px- Given a source 
sequence X", an encoder generates an index representing the source sequence and sends it over a noise-free, rate- 
limited link to an action decoder and a source decoder. An action sequence is then selected based on the index. 
With input (X", A") whose current symbols do not depend on the previous channel output, the side information 
(5", S2) is generated as an output of the memory less channel with transition probabihty 

n 
i=l 

The side information is then mapped to the partial side information for the encoder and the decoder by the 
mappings li''\S^,S2) = and lf^{S^,S2) = S^. Next, the encoder uses knowledge about to generate 
another index and sends it to the source decoder. Given the indices and the side information 5J the source decoder 
reconstructs the source sequence as X". On the other hand, the encoder also estimates the decoder's reconstruction 
as X". 

Definition 1: An (|W^"^ |, n)-code for a memoryless source with partially known two-sided action-dependent side 
information and a CR constraint consists of the following functions: 
an encoder 1 

/l"^ Af" ^ W}"^ = {1, 2, . . . , 

an action decoder 
an encoder 2 

/i") : X- X S: ^ W(") , ) = {1, 2, . . . , \wt^\}, 

a source decoder 

gin) . X >V^") X 5J i", 
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and a CR mapper 



where = |>V}"V IW2 i- 

Let d : X X X ^ [0, 00) and A : A ^ [0, 00) be the bounded single-letter distortion and cost measures. The 
average distortion between a length-n source sequence and its reconstruction at the decoder, and the average cost 
are defined as 



E 



E 



n 



.i=l 

n 



where and A(")( ) are the distortion and cost functions, respectively. 

The average probability of error in estimating the decoder's reconstruction sequence is defined by 

Definition 2: A rate-distortion-cost triple {R, D, C) is said to be achievable if for any 5 > 0, there exists for all 
sufficiently large n an (|>V(") I, n)-code such that ^ log |>V(") \<R + 5, 

<D + 6, i;[A(")(^")] <C + 5, and <5. 

The rate-distortion-cost function i?ac,cr(-D, C) is the infimum of the achievable rates at distortion level D and cost 
C. 

Theorem 1: The rate-distortion-cost function for the source with a CR constraint and action-dependent partial 
side information available at the encoder and the decoder is given by 

R^cAD, C) = mm[I{X- A) + I{X; X, Se\A) - I{X; Sd\A)], (1) 

where the joint distribution of {X, A, Sg, S^, X) is of the form 

Pxix)PA\x{a\x)Ps^,s^\x,A{^e, Sd\x, a)Px\x,Se,A^^\^' 

and the minimization is over all P^i^ and Px\x a subject to 

E[d{X,X)] < D, E[A{A)] < C. 

Proof: The proof follows similar arguments as in [7] with some modifications in which we extend the SI- 
channel transition probability to the two-sided SI Pse,Si\x,Ay and consider the additional CR constraint at the 
encoder as in [11]. In the following, we give a sketch of the achievabiUty proof. An action codebook {a"} of size 
2n{i{X;A)+5^) generated i.i.d. Pa- For each a" another codebook {f "} of size 2("(^(^''^''^«l"*)+''»)) is generated 



i.i.d. 



■ X\A' 



These codewords are then distributed at random into 2"(^('^;^''^=l^) ii^'^Si\A)+2S^) equal-sized bins 



(see Fig. 2). Given the source sequence a;" the encoder in the first step uses n{I{X; A)-\-5f:) bits to transmit an index 
representing the action codeword a" which is jointly typical with a;" to the decoder. Then the action-dependent 
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2n(I(X;Si\A)~S.) 2"('(*!Sd|A)-{,) 



a codeword X" 



2n(/(X;X,S.|yl)-/(X;Sj|A)+2{,) Ijjjjj 



Fig. 2. Binning for the achievability: for each codeword a", a codebook {£"} of size 2"^^^-^'-^'^<'\^^+^<^') is generated i.i.d. each ~ Px\A' 
Then they are distributed uniformly into 2"(^('^>'^''5El'4)-^(^iS<i|A)+2<'e) equal-sized bins. 



SI is generated based on a;" and a". Given a;", s" and previously chosen a", the encoder in the second step uses 
another n{I{X; X, Se\A) — I{X; Sd,\A) + 25^) bits to communicate the bin index of the jointly typical codeword 
f In addition, the encoder produces this jointly typical f " as an estimate of the decoder's reconstruction. Given 
the identity of a", the bin index of i", and the side information s^, the decoder will find with high probabihty the 
unique codeword in its bin that is jointly typical with and a". Finally, the decoder reconstructs 5" = x". 
For completeness, we provide the detailed achievabiUty proof and converse proof in Appendix B. ■ 
Remark 1: We can also express R^c,a{D,C) in (1) as 

i?ac,cr(-D, C) = min[/(X; A) + H{X\A) - H{X\A, X, S^) - H{X\A) + H{X\A, Sa)] 

min[/(X; A) - H{X\A, X, S^, So) + H(X\A, So)] 

= m.m[I{X;A)+I{X-X,S,\A,Sd% (2) 

where (*) follows from the Markov chain X — {X, A, Sg) — Sd and the minimization is over the same distribution 
as in (1). 

Lemma 1: The rate-distortion-cost function R^c,ci{D, C) given in (1) and (2) is a non-increasing convex function 
of distortion D and cost C. 

Proof: Proof is given in Appendix A. ■ 

B. Other Results 

In the following, we provide some connecting conclusions which help develop our understanding and also relate 
our main result to other known results in the hterature. We consider the case where the common reconstruction 
constraint is omitted and then our setting recovers the source coding with action-dependent SI setup of [7]. On 
the other hand, if we have no control over the SI, then our setting simply recovers source coding with common 
reconstruction [11]. We might also consider a special case where side information at the encoder or the decoder is 
absent. The result in this case can be derived straightforwardly by setting the SI to be a constant value. 

Proposition 1: When the additional CR constraint is omitted, the rate-distortion-cost function for the source with 
action-dependent partial side information available at the encoder and the decoder (no CR) is given by 

R,c{D, C) = min[/(X; A) + I{U; X, S,\A) - I{U; Sd\A)], (3) 
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where the joint distribution of {X, A, Sg, Sd, U) is of the form 

Px{x)PA\x{a\x)Ps^,Sd\X,A{Se, Sd\x, a)P[/|x,Se ,A ("k, Se, a) 

and the minimization is over all Pa\x^ Pu\x,Se,A and g : U x X subject to 

E[d{X,g{U, Sd))] < D, E[A{A)] < C, 

and U is the auxiUary random variable with \U\ < \A\\X\ + 3. 

Proof: The rate-distortion-cost function in this case can be derived along the Unes of Theorem 1. The 
achievabUity proof is a straightforward modification of that of Theorem 1 where the codeword J7" is used instead 
of X" and the decoding function g is introduced (similarly as in the Wyner-Ziv problem). The converse proof is 
given in Appendix C. ■ 

Corollary 1: For a special case where the side information at the encoder is absent, the rate-distortion-cost 
function for the source with action-dependent side information available at the decoder (and CR constraint) can be 
derived as a special case of Proposition 1 (Theorem 1) by setting to a constant value. 

Corollary 2: For the case where the side information at the encoder is absent and we have no control over the 
SI at the decoder, i.e., the action alphabet size is one, the rate-distortion function for the source with CR constraint 
is given by 

R„iD)=mm[IiX;X\Sd)], (4) 
where the joint distribution of {X, Sa, X) is of the form 

Px {x)Ps^ ix{sd\x)Px^x (^k) 

and the minimization is over all Px^x subject to E[d{X,X)] < D. Note that this result recovers Theorem 1 in 
Steinberg's coding and conomon reconstruction [11]. 

Since the action sequence is taken based on a rate-hmited link which is part of the total rate from the encoder 
to the decoder (see Fig. 1), in some cases, we might be interested in characterizing the individual rate constraint in 
the form of a rate region. Here we consider the same setting as in Fig. 1, but we assume that the rate on the link 
used for generating the action sequence is denoted by and the remaining rate from the encoder to the decoder 
is denoted by i?2. 

Corollary 3: The rate-distortion-cost region is given by the set of all i?2, D, C) satisfying 

Ri > I{X;A) 
Ri+R2> I{X; A) + I{X- X, Se\A, Sd) 
D > E[d{X, X)] 
C > E[AiA)], 

where the joint distribution of {X, A, Se, Sd, X) is of the form 

Pxix)PA\xia\x)Ps^,s^\x,Ai^e, Sd\x, a)Px\x,Se,Ai^\^^ ^e, a). 



9 



Note that the result is related to the successive refinement rate-distortion region where we might consider the action 
sequence as a reconstruction sequence in the first stage, and the refinement stage involves the side information 
available at the encoder and the decoder (^e, Sd). We also note that the rate-distortion-cost function in Theorem 1 
is simply a constraint on the total rate R = R\+ R2 for a given distortion D and cost C. 

Proof: The proof is a modification of that of Theorem 1 where we consider instead the individual rate 
constraints. More specifically, the achievable scheme of Theorem 1 is modified so that the index W\ is split 
into two independent parts (W^i,i, ^^1,2), and the action sequence is selected based on only Wi^\. In the converse, 
the sum-rate constraint is the same as in the converse proof of Theorem 1, while the constraint on Ri can be 
derived straightforwardly using the techniques from the point-to-point lossy source coding. ■ 

C. Binary Example 

We will show an example of the rate-distortion-cost function for the special case considered in Corollary 1 where 
the SI at the encoder is absent. Our example is a combination of examples in [7] and [11] which are based on 
the Wyner-Ziv example [3] and illustrate nicely the expected behavior of the rate-distortion function due to the 
implication of action-dependent side information with cost [7] and conmion reconstruction constraint [11]. 

We consider a given source and side information distribution Px, Psd\x,A- We assume binary action A = 
{0, 1} with A = 1 corresponding to observing the side information symbol and A = to not observing it. We 
assume that an observation has a unit cost, i.e., h.{A) = A and £^[A(^)] = Pa(1) = C. We note that the second 
mutual information term in (2) neglecting Se corresponds to the CR rate-distortion function [11, eq.(8)] conditioned 
on A. Let Di be the contribution to the average distortion given A = i, i = 0, 1, i.e., (1 — C)Dq + CDi = D. 
Thus, the specialization of Theorem 1 for this case gives 

i?ac,cr(AC)= , min I{X-A) + {l-C)-R{Px\A=o,Do)+C-R„{Px,s,\A=i,D^), 

(5) 

where R{Px,D) denotes the rate-distortion function of the source Px without side information and Ra{Px,SdT 
denotes the CR rate-distortion function defined in [11] when source and side information are jointly distributed 
according to Px, Sa- 
lt is interesting to compare i?ac,cr(^, C) to the rate-distortion-cost function of the case without the CR constraint 
i?ac(-D, C) (a special case of (3) when neglecting Se) to see how much we have to "pay" for satisfying the additional 
CR constraint. In this case 

i?ac(AC)= min I{X;A) + {l-C)-R{Px\A=o,Do) + C-RUPx,Sd\A=i,D^), 

(6) 

where Rvn{Px,Sd,D) denotes the Wyner-Ziv rate-distortion function when source and side information are jointly 
distributed according to Px,Si ■ We note that the difference between (5) and (6) is only in their last terms. 

Let us consider a binary symmetric source, a binary reconstruction, and a symmetric side information channel 
when actions are taken to observe the side information. That is, X = X = Sd = {0, 1}, where X is distributed 
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Fig. 3. Rate-distortion curves for tlie binary symmetric source witli common reconstruction and action-dependent side information available 
at the decoder The markers X correspond to the cases with CR constraint; the markers □ correspond to the cases without the CR constraint. 
The different line styles coirespond to different costs (dotted C = 0, dashed-dotted C = 1/2, and solid C = 1). 



according to Bernoulli(l/2), and the side information Sd is given as an output of a binary symmetric channel with 
input X and crossover probability po when A = 1. The Hamming distance is considered as the distortion measure. 
In [11, Example 1] the author computes the CR rate-distortion function for this source, 

Rc.{Px^S,\A=i,D) = hipo^D) - h{D), 0<D< 1/2, 

where h{-) is the binary entropy function and pQ-kD = po{l — D) + {1 —po)D. As known from [3] the Wyner-Ziv 
rate-distortion function for this source is given by 

RUPx,s,\A=i,D) = inf [e{hipo*P) - HP))], 

for i) < D < Po, where the infimum is with respect to all 6, (3, where < 6* < 1 and < /3 < po such that 
D = ep + {I ~ 9)po. In addition, we know that R{Px\a=o,D) = 1 - HD) for this source [27]. 

Using these results, we can compute (5) and (6), and compare Rac.a{DiC) and R-dc{D,C) to illustrate the 
consequences of enforcing the CR constraint. For a given C = 0, 1/2, and 1, and po = 1/4, we plot the rate- 
distortion tradeoffs in Fig. 3. The plot shows that there is a rate penalty when the CR constraint is required. This 
penalty changes according to an action-cost as shown by the gap between i?ac,cr(£', C) and R^dD, C) for different 
costs. Also, with the additional CR constraint, there is a tradeoff between the action-cost used for generating S"^ 
and the minimum rate one can compress to achieve a desired distortion level. That is, "spending" too much on 
generating the SI for the decoder can negatively influence the common reconstruction capability of the encoder, 
and thus affect the minimum rate required to compress the source. 
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Fig. 4. Channel with action-dependent state information and reversible channel input. 



III. Channel Coding with Action-dependent State and Reversible Input 

In this section, we consider channel coding with action-dependent state, where the state is known partially 
and noncausally at the encoder and the decoder as depicted in Fig. 4. In addition to decoding the message, 
the channel input X" is reconstructed with arbitrarily small error probability at the decoder. The corresponding 
reconstructed signal is termed reversible input. This setup captures the idea of simultaneously transmitting both 
the message and channel input sequence reliably over the channel. Our setup can be considered as a combination 
of Weissman's channel with action-dependent state [6], and Sumszyk and Steinberg's information embedding with 
reversible stegotext [13]. It is also closely related to the problems of reversible information embedding [17] and 
state amplification [16]. 

In the following, we present the problem formulation, characterize the main result which is the capacity of 
a discrete memoryless channel, and also present some other related results. The channel capacity is given as a 
solution to a constrained optimization problem with a constraint on the set of input distributions. We term this 
constraint the two-stage coding condition since it arises essentially from the two-stage structure of the encoding 
as well as the additional reconstruction constraint of a signal generated in the second stage. Also, we show in one 
example that such a constraint can be active in some cases, i.e., it actively restricts the set of capacity achieving 
input distributions, and when it is active, it will be satisfied with equality. This two-stage coding condition wiU be 
discussed further in Section IV. 

A. Problem Formulation and Main Results 

Let n denote the block length and A,Se,Sd,X, and y be finite sets. The system consists of two encoders, 
namely, an action encoder and a channel encoder, and one decoder. A message M chosen uniformly from the set 
_/V4(") = {1,2...., jA^*^"^!} is given to both encoders. An action sequence A" is chosen based on the message 
M and is the input to the state information channel, described by a triple (^, Pg^, <jj|^, iS,. x Sd), where A is the 
action alphabet, Se and Sd are the state alphabets, and Ps^.Sd\A is the transition probability from A to (Se x Sd). 
The channel state S*" — {S",S2) is mapped to the partial state information for the encoder and the decoder by 
the mappings li^-^Sl^, S^) = S2 and /^"^(S'^, S^) = S^}. The input to the state -dependent channel is denoted by 
X". This channel is described by a quadruple {X, Sg x Sd, PY\x,Se,Sd' -^)' where X is the input alphabet, y is the 



12 

output alphabet and PY\x,Se,Sd transition probability from {X x Sg x Sd) to y. The decoder, which might 

be considered as two separate decoders, i.e., a message decoder and a channel input decoder, decodes the message 
and the channel input based on channel output F" and state information S^. We assume that both state information 
and state-dependent channels are discrete memoryless and used without feedback with transition probabihties, 

n 

i=l 

n 

Py.|X-,S?,Sj(2/"k",Se>s3) = ]JPY\X,S,,sAyi\^i^Se,i,Sd,i). 

i=l 

Definition 3: An (|A1^"^|, n) code for the channels Ps^,Sd\A and PY\x,Se,Sd consists of the following functions: 
an action encoder 

fin) ,^{n) ^^n^ 

a channel encoder 

: X A*", 

a message decoder 

and a channel input decoder 

-.y^xs^^ x^. 

The average probabilities of error in decoding the message M and the channel input X" are defined by 

Definition 4: A rate R is said to be achievable if for any (5 > there exists for all sufficiently large n an 
(|Al(")|,n)-code such that ilog|A^(")| > R-5,Ptl < S, and Pi"i < S. The capacity of the channel is the 
supremum of all achievable rates. 

Theorem 2: The capacity of channels with action-dependent state available noncausally to the encoder and the 
decoder and with reversible input at the decoder shown in Fig. 4 is given by 

C = max[7(A, X; Y, Sd) - I{X; Se\A)], (7) 

where the joint distribution of {A, Se,Sd, X, Y) is of the form 

PAia)Ps^,Sd\AiSe, Sd\a)Px\A,Seix\a, Se)PY\x,Se,sAy\x, Se, Sd) 

and the maximization is over all Pa and Px\a,s^ such that 

0<I{X;Y,Sd\A)-I{X;Se\A). (8) 
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a codeword 2-('(^--^'^^\^MS'W^'^') bins 

Fig. 5. Binning for the acliievability: for each codeword n", a codehook {x"} of size 2"^^^-^'''^'^<i\^)-^e) is generated i.i.d. each ~ Px\A- 
Then they are distributed uniformly into 2"(^('^'^''5£il^)-^(-^;'Sel'4)-2i5e) equal-sized bins. 



Proof: We prove achievability by showing that any rate R < C is achievable, i.e., for any 5 > 0, there exists 
for all sufficiently large n an (|A4(")|,n) code with ilog|Al(")| > i? - ^, and average probabihties of error 
Pm}e < S, and Px^e < S. The proof of achievabihty uses random coding and joint typicahty decoding. Conversely, 
we show that given any sequence of (|A4^"^|,n) codes with ^ log > R — S„, Pm}e < S„, and Px^e < S„, 

then R< C. The proof of the converse uses Fano's inequahty and properties of the entropy function. 

The achievabihty proof follows arguments in [6] with a modification in which we use the channel input codeword 
a;" directly instead of the auxiliary codeword. In the following, we give a sketch of the achievability proof. 
An action codebook {a"} of size 2"(^(^''^'^<*)~'^«) is generated i.i.d. ~ Pa- For each a", another codebook 
{a;"} of size 2"^^^-^'^'^''\^^~^''^ is generated i.i.d. ~ Px\a- Then the codewords are distributed uniformly into 
2n{i{X;Y,Sd\A)-i{X;Se\A)-26e) equal-sized bins (see Fig. 5). Given the message m = (mi, 7712), the action codeword 
a" (mi) is selected. Then the channel states {s^,s2) are generated as an output of the memoryless channel 
with transition probability Pgn^gji^n (s", s^la") = YYi^i Pse,Sd\A{se,i, Sd^i\ai). The encoder looks for a;" that 
corresponds to mi and is in the bin m2 such that it is jointly typical with the selected a" and s". For suffi- 
ciently large n, with arbitrarily high probabihty, there exists such a codeword because there are approximately 
2n{i{X;Se\A)+5^) codewords in the bin. Then the selected a;" is transmitted over the channel PY\x,Se,S4- Given 
and s2, the decoder in the first step looks for codeword a" that is jointly typical with y" and s^. With high 
probability, it will find one and it is the one chosen by the encoder since the codebook size is 2"(^("*'^''^'*)~'^»). 
Then, given the correctly decoded mi, the decoder in the second step looks for a;" that is jointly typical with 
J/", s^, and a". Again, with high probabihty, it will find one and it is the one chosen by the encoder since 
the size of the codebook is 2"(^(^'^''^<*l'*^~*«\ The corresponding bin index is then decoded as m2. In total, 
I{A; Y, Sd) + I{X; Y, Sd\A) — I{X; Se\A) — 36^ bits per channel use can be used to transmit the message m such 
that both m and a;" are decoded correctly at the decoder. Note that the above coding scheme which sphts the 
message into two parts and decodes them sequentially works successfully when we have a proper positive number 
of bins for codewords a;", i.e., I{X; Y, Sd\A) — I{X; Se\A) — 25^ > 0. The more detailed achievabihty proof and 
converse proof are given in Appendix D. ■ 

We term the condition I{X;Y, Sd\A) — I{X;Se\A) > which appears in Theorem 2 the two-stage coding 
condition since it represents the underhning sufficient condition for successful two-stage coding. It plays a role 
in restricting the set of input distributions in the capacity expression. It is also natural to wonder whether the 
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Fig. 6. (a) the corresponding region witli A/ > 0, (b) the corresponding region with AI < 0. 



two-Stage coding condition can really be active or is always inactive when computing the capacity. In Example 1, 
Subsection C, we show by example that there exists a case where the condition is active. In the following results 
we also show that if the condition is active, then it is satisfied with equality, i.e., the capacity is obtained with 
I{X; Y, Sd\A) — I{X; Se\A) = 0. More details on the two-stage coding condition and its connection to other related 
problems will be given in Section IV. 

Remark 2: It is possible to consider an action symbol as another input to the memoryless channel PY\x,Se,Sa- 
The capacity expression for this more general channel PY\x,Se,Sd,A remains unchanged. This can be shown by 
defining the new state S'^ = {Se,A) and then applying the characterization in Theorem 2. 

Proposition 2: If the two-stage coding condition is ignored, and the solution to the unconstrained problem in 
(7) results in I{X; Y, Sd\A) — I{X; Se\A) < (the two-stage coding condition is active), then the actual channel 
capacity will be obtained with I{X; Y, Sd\A) - I{X; Se\A) = 0. 

Proof: We consider a set T^mod containing pairs of rate R and dummy variable R € R introduced for the 
two-stage coding condition, i.e., 

V^od = {{R, R):0<R< I{A, X; Y, Sa) - I{X- Se\A) ^ I{A- Y, Sd) + A/ 

it < liX; y, Sd\A) - I{X- S,\A) ^ AI 

for some PA{a)Ps,,Sd{A{Se, Sd\a)Px\A,S,{x\a, Se)PY\X,S^,Sa{y\^' ^e, Sd)} 

For each Pa € 'PA,Px\A,Se ^ T'^x\A,Se' compute a tuple {I{A;Y, Sd) + A7, A/), and obtain the 

corresponding region as shown in Fig. 6. We can show that the region 7?Tnod is convex (see Appendix E). Then, to 
evaluate the region T^mod. we find the union of all regions obtained from all possible Pa G Va, Px\A,Se ^ 'Px\A,Se- 
Our main task is to compute the channel capacity so we are interested in finding the maximum rate R under the 
feasible value of A7, i.e., A7 > 0. Since T^mod is convex, one can show that there are only two possible shapes of the 
region 7^mod> i-c, the ones where the maximum of R is obtained with non-negative and negative A/, respectively. 
This is depicted in Fig. 7. The case (6) in Fig. 7, which is the case where the two-stage coding condition is active, 
is of interest here. Since the feasible solutions have to satisfy A/ > 0, we can conclude that when the two-stage 
coding condition is active, the channel capacity will be obtained with A/ = 0. 
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Fig. 7. (a) the region T^niod where the maximum R achieved with AI > 0, (b) the region T^niod where the maximum R achieved with AI < 0. 



B. Other Results 

In the following, we provide some conclusions which help develop our understanding and also relate our main 
result to other known results in the literature. We consider the case where the reversible input constraint is omitted 
and then our setting recovers Weissman's channel with action-dependent state [6]. On the other hand, if the channel 
state sequences are given by nature, i.i.d. according to some distribution, then our setting simply recovers the special 
case of information embedding with reversible stegotext [13]. We also consider the special case where channel state 
information at the encoder or the decoder is absent. The result in this case can be derived straightforwardly by setting 
the channel state variable to a constant value. Lastly, it is also natural to consider the case where the decoder is 
interested in decoding the message and the encoder's state information instead. By this, the channel input sequence 
can be retrieved based on the decoded message, the encoder's state information, and a known deterministic encoding 
function. We show that if the objective is to decode only the message and the channel input, then decoding the 
message and encoder's state information first, and then re-encoding the channel input is suboptimal. 

Proposition 3: When the reversible input constraint is omitted, the capacity of the channel with action-dependent 
state available noncausally to the encoder and the decoder is given by 

Cm = max[/(A, U; Y, S^) - I{U; 5e|A)], (9) 

where the joint distribution of (^4, Se,Sci, U, X, Y) is of the form 

PA{a)Ps,,SMi^e, Sd\a)Pu\A,sAu\a, Se)l[x=f{u,s,)}Py\x,Se,sAy\^' ^e, Sd)- 
and the maximization is over Pa, Pu\A,Se ^^id f :U x X, and U is the auxiliary random variable with 

\U\ < \A\\Se\\X\ + l. 

Proof: The proof follows from arguments in [6] with modifications such that the state 5" = (S'",S'^), and 
{Y",S2) are considered as the new channel output, and a set of distributions is restricted to satisfy the Markov 
relations U - {A, Se) - Sd and X - {U, 5,.) - {A, Sd). ■ 
Corollary 4: For the special case where the state information at the decoder is absent, the capacity of the channel 
is given as a special case of Theorem 2 by setting Sd to a constant value. 
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Corollary 5: For the case where the state information at the decoder is absent and the channel state is given by 
nature, i.e., the action alphabet size is one, the capacity of the channel is obtained as 

Cstegotext = max[/(X; Y) - I{X; S^)], (10) 

where the joint distribution of (^e, X, Y) is of the form 

-PSe iSe)Px\S, ix\Se)PY\X,S, ivlx, Sg) 

and the maximization is over all Pxis^ ■ Note that this recovers a special case of the results on information embedding 
with reversible stegotext [13] when there is no distortion constraint between X" and S*". 

Next we are looking at a related problem which later helps us interpret the two-stage coding condition. We 
consider a new and slightly different communication problem where the decoder is interested in decoding instead 
the message M and the state 5". Due to a deterministic encoding function, the channel input signal can be 
retrieved based on the decoded message and the encoder's state information. This communication problem has a 
more demanding reconstruction constraint than our main problem considered in Fig. 4 since it essentially requires 
that the decoder can decode the message, the encoder's state, and the channel input signal, all reliably. 

Proposition 4: Consider a new communication problem which is slightly different than the one considered in 
Fig. 4 in that the decoder is interested in decoding the message M and the state 5" reliably. The capacity of such 
a channel is given by 

Cs, = max[7(A, S,, X; Y, Sa) - H{Se\A)], (11) 
where the joint distribution of (A, 5e, S^, X, Y) is of the form 

PA{a)Ps,,Sa\AiSe, Sd\a)Px\A,sAxh Se)PY\X,S,,sAy\Xy Sd) 

and the maximization is over all Pa and Px\A,Se such that 

0<I{Se,X;Y,Sd\A)-H{Se\A). (12) 

Proof: Since decoding M and implies that X" is also decoded from the deterministic encoding function, 
one can substitute {Se,X) in place of X in Theorem 2 and obtain the capacity. More specifically, the achievable 
scheme in this case is different from the previous case of decoding M and X" in that the SI codebook is introduced 
and it has to "cover" all possible generated S'" losslessly. That is, the size of the SI codebook should be sufficiently 
large so that the encoder is able to find an exact S'" from the codebook. Similarly to Theorem 2, in the capacity 
expression, we also have a similar restricting condition < I{Se,X]Y,Sd\A) — H{Se\A) on the set of input 
distributions. Besides the rate constraint, this condition can be considered as a necessary and sufficient condition 
for the process of losslessly compressing S*" through X" and then transmit them reUably over the channel in our 
two-stage communication problem. The detailed achievability proof and the converse proof are given in Appendix F. 
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Remark 3: We know that the channel input sequence can be retrieved based on the decoded message, the encoder's 
state information, and a known deterministic encoding function. Therefore, it is natural to compare the capacity C 
in Theorem 2 with Cs^ in Proposition 4. For a given channel Pse,Sd\Aj PY\x,Se,Sd' we have that C > Cs^. 

Proof: One can show that I{Se, X; Y, Sd\A)-H{Se\A) < I{X; Y, Sd\A)-I{X; Se\A) for all joint distributions 
factorized in the form of PA{a)Ps^,Sd\A{se, Sd\a)Px\A,sA^\"-^ Se)PY\x,s^,sAy\^^ ^e, Sd)- This implies that Cs^ is 
evaluated over a smaller set than that of C. In addition, one can show in a similar fashion that I {A, Se,X; Y, Sd) — 
H{Se\A) < I{A, X; Y, Sd) - I{X; Se\A), and thus conclude that C > Cs,. ■ 

We note that this new communication problem is closely related to the problems of state ampUfication [16], 
and reversible information embedding [17]. The main difference is that, in our setting, channel states are generated 
based on the action sequence. In [16] the decoder is interested in decoding the message reUably and in decoding the 
encoder's state information within a list, while in [17], the decoder is interested in decoding both the message and 
the encoder's state information reliably. The result in Remark 3 is also analogous to that in information embedding 
with reversible stegotext [13] in which the authors show that if the objective is to decode only M and X", then 
decoding M and 5" first and re-encoding X" using a deterministic encoding function is suboptimal. 

C. Examples 

In the following, we show two examples to illustrate the role of the two-stage coding condition in restricting a 
set of input distributions in the capacity expression. Example 1 shows that the two-stage coding condition can be 
active in computing the capacity, while Example 2 shows that there also exists a case where such a condition is 
not active at the optimal design. 

Example 1: Memory Cell With a Rewrite Option 
For simpUcity, let us consider a special case where S^ is absent as in Corollary 4 and the channel is in the more 
general form PY\x,Se,A as in Remark 2. We consider a binary example where A,X,Se,Y £ {0, 1}, and the scenario 
of writing on a memory cell with a rewrite option. The first writing is done through a binary synmietric channel 
with crossover probabiUty 6 (BSC((5)), input A, and output Sg. Then, assuming that there is a perfect feedback of 
the output Sg to the second encoder, the second encoder has an option to rewrite on the memory or not to rewrite 
(indicated by a value of X). If the rewrite value X = 1 which corresponds to "rewrite," then Y is given as the 
output of BSC(J) with input A (rewrite using the old input). If X = which corresponds to "no rewrite," we 
simply get Y = Sg. In this case the decoder is interested in decoding both the embedded message and the rewrite 
signal. See Fig. 8 for an illustration of this rewrite channel. 

From Theorem 2 and Remark 2, we know that the capacity of this channel is given by 

C = ma.x[I{A,X;Y)-I{X;Se\A)], (13) 
where the joint distribution of {A, Sg, X, Y) is of the form 

PA{a)Ps,\A{Se\a)Px\A,Se{x\a, Sg)PY\X,Se,A{y\^^ o) 
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Fig. 8. Two-stage writing on a memory cell with a rewrite option. 
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and the maximization is over all Pa and Px\A,Se such that 

0<I{X;Y\A)-I{X;Se\A). (14) 

Letting A ~ Bernoulli (pa), and 

(15) 
(16) 
(17) 

(18) 

By straightforward manipulation, we get 

H{Y) = h{{l - S)il -pa){l-S + Sp) + Spa{q + S-Sq)+ 6(1 - S)il -r + rpa- sp„)), 

H{Y\A) = (1 - pa)h{{l - S){p + (1 - p){l -6) + {l- r)6)) +Pah{5{q + (1 - q)6 + (1 - - 5))), 
-H{Y\A,X)-I{X-Se\A) 

+ [(1 - + (1 - - ^ - - w). 

then 



C = max 

PoiPigi'",«6[o,i] 



19 



subject to 

< H{Y\A) - H{Y\A, X) - I{X; Se\A). 

By performing numerical optimization with S = 0.1, we obtain that the capacity of the channel equals to 0.5310 
bits per channel use. The optimal (capacity achieving) input distributions in this case are those in which X — A — Se 
forms a Markov chain, i.e., p = r,q = s, and in the end Pa is the only remaining optimization variable. We note that 
if we instead neglect the restriction on the maximization domain and solve the unconstrained optimization problem, 
we would obtain the maximum value of 0.6690 which is strictly larger than the actual capacity. Therefore, this 
example shows that there exists a case where the two-stage coding condition is active. In fact, the corresponding 
two-stage coding condition in this case is satisfied with equahty as expected from Proposition 2. 

Example 2: Inactive Two-stage Coding Condition 
In other cases the two-stage coding condition in the capacity expression might not be active. One trivial example 
is when Se — A — Sd forms a Markov chain for the action-dependent state channel Ps^,Sd\A7 and Y — {X, Sd) — Se 
forms a Markov chain for the state-dependent channel PY\x,Sc.Sd- '■^'^ case, it can be shown that for any joint 
distribution (^P^\a) , Pi^^^^ g (x|a, Sg)) such that I{X;Y, Sd\A) — I{X;Se\A) < 0, there always exists another 
joint distribution (-Pj)^' (a), Px\a s (^I*^' ■^e)) which satisfies I{X; Y, Sd\A) — I{X; Se\A) > and achieves a higher 
rate. One possible choice is to let P^\a) = P^\a) and P^Ia s (^I'^'^e) — ^S^\A{se\a)P^^jf^ g {x\a,Se). 
Consequently, the maximizing input distribution in this case will result in I{X; Y, Sd\A) — I{X; ^el^) > and the 
capacity of such a channel is given by C = maxp^^p^j^ [-^(^) ^'■i Sd) — Se\A)]. 

IV. Discussion on the Two-stage Coding Condition and Formula Duality 

In this section we discuss in more detail the presence and impact of the two-stage coding condition in Theorem 2, 
and we also consider the potential dual relations between the source coding and channel coding problems in Section 
II and III. 

A. Two-stage Coding Condition 

1 ) Operational Coding View: As can be seen in our achievable scheme, the condition I{X; Y, Sd\A) —I{X; Se\A) > 
represents a tradeoff in the size of the codebook {x"} conditioned on the action sequence. Our coding scheme 
involves random binning, and in order to encode/decode successfully we need to ensure that there is a proper positive 
number of bins to satisfy both encoding and decoding requirements, based on joint typicality. More specifically, the 
decoder is interested in decoding both the message (partly carried in the action codeword and partly as a bin index 
of x") and the codeword x" itself. From the analysis of the error probability (see Appendix D), this additional 
restriction on the number of bins (I{X-,Y, Sd\A) — I{X;Se\A) > 0) arises in part from the error event where 
only the message that is conveyed in the action codeword is decoded correctly, but not the codeword x". Since 
the action sequence carries information about the same message that is carried by the codeword x", this additional 
constraint is needed to ensure a vanishing probabiUty of such an error event (see also (32) that the two-stage coding 
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Fig. 9. Modified setting: a class of cooperative "multiple-access channel (MAC)" with common message.' 



condition is the underlying constraint on the number of bins of codewords a;"). Conversely, we also see that for 
any achievable rate, it is never possible to have a joint distribution that leads to I{X; Y, Sd\A) — I{X; Se\A) < 0. 

The condition can also be interpreted based on the structure of the encoder, which involves two-stage coding 
(the action sequence is selected first, then the channel input is selected based on the action-dependent state). 
That is, the action sequence can be decoded in the first stage, which in turn results in an extra constraint for 
decoding the channel input in the second stage. Hence the condition describes a causaUty constraint imposed by 
the communication problem. This observation might be interesting for some other problems as well. 

2) Source Coding View: We notice that the condition I(X;Y, Sd\A) — I{X;Se\A) > can be equivalently 
written as H{X\Y, Sd, A) < H{X\Se,A). Intuitively, this tells us that for reliable transmission of the channel input 
signal over the channel given that the action is communicated, the uncertainty about X that remains after observing 
Y and Sd at the decoder should be less than the uncertainty of X at the transnnitter. Hence the two-stage coding 
condition can, as a complement to the rate constraint, be considered as a necessary and sufficient condition for 
rehable transmission of the description X" of the state S" through the channel in our two-stage communication 
problem. 

Alternatively, we note that in our case we do not need to reconstruct 5" perfectly at the decoder, i.e., information 
about 5" conveyed through X" over the channel is needed only in part. We can write the condition as I{X-, Se\A) < 
I{X; Y, Sd\A) and interpret it as a condition for lossy transmission of 5" through X" over the channel given that 
A" is communicated. It is then natural to compare this to the case when we are interested in decoding M and 5", 
e.g., as in Proposition 4. In that case, we want to reconstruct S"" perfectly at the decoder; therefore, given A", the 
necessary and sufficient condition for lossless transmission of 5" through X^ over the channel PY\x,Se,Sd given 
by H{Se\A)<I{X,S,;Y,Sd\A). 

3) Channel Coding View: We may also consider the condition I{X; Y, Sd\A) — I{X] Se\A) > from the point 
of view of connecting it to a class of cooperative "multiple-access channels (MACs)" with common message. 
Consider therefore a slightly modified setting shown in Fig. 9, where there is another independent message W 
to be encoded at the channel encoder, and the message M is a common message for both encoders. This setting 
will reduce to our original problem when the rate of message W is zero. From this point of view, the condition 
I{X; Y, Sd\A) — I{X] Se\A) > is in fact a degenerate rate constraint derived from the underlying rate constraint 
of message W in the "MAC" setting. 
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B. Duality 

In this work we notice the "dual" relations between input-output of elements in the source and channel coding 
systems as depicted in Fig. 10. Similar dual relations also appear in other related problems, as Usted below. 

Wyner-Ziv's source coding (SC) (WZ, [3]) o Gel'fand-Pinsker's channel coding (CC) (GP, [4]) 
Permuter-Weissman's SC with action (PW, [7]) o Weissman's CC with action (W, [6]) 

Steinberg's SC with CR (S, [11]) -H- Sumszyk-Steinberg's CC with RI (SS, [13]) 
Section II W Section HI 
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Fig. 10. Duality between the source coding with action-dependent side information and common reconstruction (Fig. 1) and channel coding 
with action-dependent states and reversible input (Fig. 4). 



As Stated before in the introduction part, we are interested in investigating/ormM/a duality of a set of problems [1]. 
Table 1 below summarizes the rate-distortion(-cost) function and the channel capacity expressions of the interested 
problems, neglecting the optimization variables (input probabihty distribution). 

As in [1] we can recognize the formula duality of the rate-distortion(-cost) function and the channel capacity by 

^Based on this scenario, one can also recover special cases of results available for the multiple-access channel with common message. For 
example, if the encoder state information Se is assumed to be a deterministic function of A, then this modified setting will reduce to a class 
of MAC with common message and cribbing encoder, and eventually to a class of MAC with common message. To decode both messages and 
the channel input X" at the decoder is then equivalent to just decode messages M and W. 
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TABLE I 

Rate-distortion(-cost) function and channel capacity. 



Problems 


Rate-distortion-cost function 


Channel capacity 


WZ and GP 
PW and W 
S and SS 
Sec. n and in 


iiwz(O) = min[7(C/;X) - I{U,Sd)] 
Rpw{D, C) = min[7(A; X) + I{U; X\A) - I{U, Sd\A)] 

Rs{D) = min[7(X; X) - I{X; Sa)] 
R{D, C) = min[7(A; X) + I{X; X, Se\A) - I{X; Sd\A)] 


Cgp = max[/([/; Y) - I{U; Se)] 
Cyf = max[7(A; Y) + I{U;Y\A) - I{U;Se\A)] 
Css = max[7(X; Y) - I{X; Se)] 
C = maxp* [I{A; Y, Sa) + I{X; Y, Sd\A) - I{X- Se\A)] 



the following correspondence, 

Rate-distortion-cost o Channel capacity 

minimization o maximization 
X (source symbol) ^(received symbol) 
X (decoded symbol) ^ X (transmitted symbol) 
5e (state at the encoder) -H- 5^ (state at the decoder) 
S'd(state at the decoder) •<->■ (state at the encoder) 
t/ (auxiliary) [/(auxiliary) 
^(action) -o^ ^4 (action). 

We see that the first three cases in Table I are obvious from the expressions of the rate-distortion(-cost) function 
and the chatmel capacity, while the last duality (Sees. II and III) does not hold in general due to the fundamental 
differences in the source and chatmel coding problems. We now give reasons based on the dual roles of the 
encoder/decoder in the source coding problem and the decoder/encoder in the channel coding problem. 

The first reason that the last duaUty does not hold in general is the presence of the two-sided side/state information. 
That is, at the encoder in the source coding setup, the processing is sequential, i.e., the action-dependent side 
information is generated first and then the side information 5" is used in compressing the source sequence. However, 
this sequential processing is not required in the decoding process of the channel coding problem since the state 
information for both encoder and decoder are generated in the begirming, and both F" and are available at 
the decoder noncausaUy. The effect of this fundamental difference can be seen from the difference in the terms 
I{A, X) and I{A; Y, Sd) in the rate-distortion-cost function and chatmel capacity expressions in Table I. 

The second reason is the additional reconstruction constraint imposed on the two communication problems. First 
consider the chatmel coding problem where we require to decode as well the chatmel input sequence (reversible 
input constraint). In our problem the encoder has a causal structure; that is, 5" is generated first, then followed 
by X". When we require to decode X" which is the signal generated in the second stage, the two-stage coding 
condition, apart from the rate constraint, is necessary to ensure reliable transmission of the chatmel input X". In 
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fact, it plays a role in restricting the set of capacity achieving input distributions marked by p* in Table I. Now 
we consider the source coding counterpart where we require the encoder to estimate the decoder's reconstruction 
(common reconstruction constraint). Although there seems to be a similar two-stage structure in the decoder of this 
setup, the two-stage coding condition is not relevant here. This is because the common reconstruction is performed 
in the beginning at the encoder side and the identity of action sequence is in fact known at both sides due to the 
noiseless hnk between the encoder and the decoder. 

V. Conclusion 

In this paper we studied a class of problems that extend Wyner-Ziv source coding and Gel'fand-Pinsker channel 
coding with action-dependent side information. The extension involves having two-sided action-dependent partial SI, 
and also enforcing additional reconstruction constraints. In the source coding problem, we solved the rate-distortion- 
cost function for the memoryless source with two-sided action-dependent partial SI and connmon reconstruction, 
while in the channel coding problem, the capacity of the discrete memoryless channel with two-sided action- 
dependent state and reversible input is derived under the two-stage coding condition. In fact, this two-stage coding 
condition arises from the additional reconstruction constraint and the causal structure of the setup, i.e., the channel 
input signal to be reconstructed is generated in the second stage transmission. Besides the message rate constraint, 
it can be considered as a necessary and sufficient condition for rehable transmission of channel input signal over 
the channel given that the action is cormnunicated. An intuitive interpretation derived from its expression is that 
uncertainty about the channel input remaining at the receiver after observing the channel output and the decoder's 
state information should be less than that at the transmitter. 

We were also interested in investigating the formula duality between rate-distortion-cost function and channel 
capacity of the source and channel coding problems. Although our extended problems seem to retain the dual 
structure seen in Wyner-Ziv and Gel'fand-Pinsker problems, they are not dual in general. In fact, there is "operational 
mismatch" caused by enforcing causahty in parts of the system. For example, the two-sided SI in the source coding 
problem requires a sequential encoding process, while in the channel coding problem the channel output and state 
information are available noncausally to the decoder. Moreover, when we require additional reconstruction of the 
channel input in the channel coding problem, the two-stage coding condition is needed due to the causal structure 
of the encoder where the channel encoder has to wait for the state to be generated based on the action sequence. 

We find it interesting to note that the two-stage coding condition which appears in the capacity expression can 
be active, as shown in one example. This is, however, not surprising since the condition can also be seen as a 
degenerate rate constraint of the underlying rate constraint in a cooperative MAC setup (see Section IV, part A). 
We notice that by imposing an additional reconstruction constraint on that related problem, we are still able to 
derive a closed form solution. This leads us to beheve that it might be possible to consider other (possibly open) 
network information theory problems with additional reconstruction constraints, and be able to derive the closed- 
form solutions. In addition, if we obtain a similar two-stage coding condition in the solution, we might be able to 
find a class of channels of which the capacity can be achieved with the input distribution that results in an inactive 
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two-Stage coding condition. This can provide some insights into the role of the additional reconstruction constraint 
in some conomunication channels, and should be considered as a topic for future work. 

Appendix A 
Proof of Lemma 1 

Since the domain size of minimization in (1) or (2) increases with D and C, i?ac,cr(^, C) is non-increasing in 
D and C. For convexity, we consider two distinct points {Ri,Di,Ci), i = 1,2, which he on the boundary of 
-Rac,cr(-D, C). Suppose [P^^x' ^^]x s * ~ ^' achieve these respective points, i.e., 

i?. =i?ac,cr(A,a) =/(^'(X;A)+/«(l;X,5e|A„5d), 1=1,2, 

where I^''H-) denotes the mutual information associated with Pi^}^ and P^~} 

Let Q e {1, 2} be a random variable independent of X and conditionally independent of (^e, Sd) given {X, A), 
with Pq(1) = 1 - Pq(2) = a, < a < 1. Then we have the joint distribution 

PQ,x,A,s,,s^,xil' 2;, a, Se, Sd, x) = PQiq)Px{x)PA\x,Qia\x, q)Ps,,s^\x,Aise, Sd\x, a)Px\x,S,,A,Qi^\^^ «e, a, q), 

where PA\x,Q{a\x,ci) = P^'|^(a|.T) and s^, a, g) = ^'^t-xr.Se.A^*'^' ^ = ^'2- 

Consider now the marginal distribution (averaged over Q) 

-Px,A,Se,Sd,x(^' «. «e, Sd, X)= PQ{q)Px{x)PA\X,Q{a\x, Q')PSe,S<j|X,A(Se, Sd\x, a)Px\x,S,,A,Q{^\x, Se, a, q), 

q=l,2 

which is associated with the sum of mutual information terms I{X; A) + I{X; X, Se\A, Sd). It follows that 
I{X;A) + I{X;X,SM,Sd) 

= I{X; A, X, Sd) - I{X; Sd\A) + I{X; Se\X, A, Sd) 

= H{X) - H{X\A, X, Sd) - HiSd\A) + H{Sd\X, A) + H{S,\X, A, Sd) - H{S,\X, A, Sd, X) 

H{X\Q) - H{X\A, X, Sd) - H{Sd\A) + H{Sd\X, A, Q) + H{Se\X, A, Sd, Q) - H{Se\X, A, Sd, X) 
< H{X\Q) - H{X\A, X, Sd, Q) - H{Sd\A, Q) + H{Sd\X, A, Q) + H{Se\X, A, Sd, Q) - H{Se\X, A, Sd, X, Q) 
= I{X; A, X, Sd\Q) - I{X; Sd\A, Q) + I{X; Se\X, A, Sd, Q) 

= I{X; A\Q) + I{X; X, Se\A, Sd, Q) 

= A[J(1) {X;A)+ /(I) {X; X, Se\A, Sd)] + (1 - A) {X;A) + I^^) {X; X, Se\A, Sd)], 

where (*) follows from X ±Q and the Markov chain (^e, Sd) — {X, A) — Q. 
Consider also the average distortion and cost (averaged over Q), 

D = E[d{X,X)] = XE^^^[d{X,X)] +{1- X)E'^^'>[d{X,X)] = Afi + (1 - A)£»2 
and C = E[A{A)] = XE^'^^ [A{A)] + (1 - A)^^^) [A{A)] = ACi + (1 - A)C2. 
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Then, by the definition of the rate-distortion-cost function RacaiD, C), it follows that 

R.c,cr{XDi + (1 - X)D2, XCi + (1 - A)C2) = R^AD, C) 
<I{X;A) + I{X;X,S,\A,Sd) 

< A[7(i) {X; A) + 7(1) {X; X, S^, S^)] + (1 - A) [/(^^ {X;A)+ Z^^) {X; X, Se\A, S^)] 

= A-Rac,cr(-Di, Ci) -|- (1 — A)i?ac,cr(-D2, C2). 

Thus, we have shown that -Rac,cr(-D, C) is a non-increasing convex function of D and C. ■ 

Appendix B 
Proof of Theorem 1 

A. Achievability Proof of Theorem 1 

The proof follows from a standard random coding argument where we use the definitions and properties of 
e-typicahty as in [28], i.e., the set of e-typical sequence for e > with respect to Px{-) is denoted by 

T^(")(X) = {a;" e Af" : ^iV(a|a;") - Px{a) < ePx{a), for all a e A"}, (19) 



Codebook Generation: Fix Pa\x,Px\x a- Let wj"^ ={1,2,..., |>vj"' |}, W^"^ = {1,2,..., |>V^"' |}, and 



where N{a\x") is the number of occurrences of a in the sequence x". 

:\ 

V^") = {1, 2, . . . , |V(")|}. For all wi G >v{") the action codewords o"(wi) are generated i.i.d. each according to 
nr=i -PAlai) and for each wi e w}") |>V^")||V(")| codewords {i"(u>i, ^2,^^)}^ g^^(-) are generated i.i.d. 

each according to Y[7=i ^i:|A('^*l'^»(^i))- codebooks are then revealed to the encoder, the action decoder, and 
the decoder. Let < eo < ei < e < 1- 

Encoding: Given a source realization a;" the encoder first looks for the smallest wi € w}"^ such that a" (wi ) is 
jointly typical with x". Then the channel states are generated as outputs of the memoryless channel with transition 
probability Pgri 5j|^™(s", s^ja") = OiLi Psc,Sd\A{se,i, Sd,i\ai), and the encoder in the second stage looks for the 
smallest W2 e W^"^ and v G V^") such that (x", ^2, t^), s" , a"(wi)) £ T^l'\x, X, S^, A). If successful, 

the encoder produces x'^{wi,W2,v) as a conmion reconstruction at the encoder and transmits indices (^1,^2) to 
the decoder. If not successful, the encoder transmits wi = 1,W2 = 1 and produces x"(l, 1, 1). 

Decoding: Given the indices wi and 11)2, and the side information sJJ the decoder reconstructs x" = x"(wi, W2, w) 
if there exists a unique v G V(") such that {s2,x"{wi,W2,v),a"-{wi)) G T^^-^Sd, X, A). Otherwise, the decoder 

puts out X" = X^{wi,W2, 1). 

Analysis of Probability of Error: Let (^1,^2, V) denote the corresponding indices of the chosen codewords A" 
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and X" at the encoder. We define the "error" events as follows. 

S^a = {(X", i T^:\X,A) for all m G w}")} 

£2 = {{X^,X^{Wi,W2,v),S^,A''{Wi)) i 4")(X,l,5e,^) for all {w2,v) G ^ x V^")} 
^3 = {(5J,X"(Wi,W2,V^),A"(W^i)) ^Ti")(5rf,je,A)} 

^4 = Vr2,^),^"(H/i)) e ri")(5d,X, A) for some v G V("),ii ^ 1/}. 

The total "error" probability is bounded by 

< Pt{£o) + Pi{£ia n £^) + Pr{£,t, D £^,) + Pi{£2 n ffj + Pr(f3 n £^) + Pr{£4), 

where £f denotes the complement of the event £i. 

0) By the law of large numbers (LLN), Pr(X" G T^o\x)) >l-Seo. Since can be made arbitrarily small 
with increasing n if eo > 0, we have Pr(fo) — > as n — > 00. 

la) By the covering lemma [28], Pr(fi„ n ^ as n ^ 00 if i log > I{X; A) + 6e^. 

lb) By the conditional typicahty lermna [28] where {S2, S^) is i.i.d. according to HiLi Psd,Se\x,A{sd,i, Se,i\xi, ai{wi 
we have Pr(f 16 n — >■ as n — >■ 00. 

2) Averaging over all Wi = wi, by the covering lemma, where each X" is drawn independently according to 
nr=i we have that Pr{£2n£^^) ^ as n ^ cx) if i log |W^"V^ log |V(")| > I{X, Se; X\A) + 

3) Consider the event £^ in which there exists {Wi,W2,V) such that W2,y), G 
Tei\x,X,A,Se). Since we have the Markov chain X — {X,Se,A) — Sd, and is distributed according to 

by using the conditional typicahty lenmia, we have 

Pr((X", W^2, y), ^"(Wi), S:, S2) G Ti")(X, X, A, Se, Si)) ^ 1 as n ^ 00. 

This implies that Pr(£3 n £|) as n 00. 

4) Averaging over all T4^i = wi,W2 = W2, and V = v [28, Ch.l2, Lemma 1], by the packing lemma [28] 
where each X" is drawn independently according to OiLi ^x\A{^i\^i)' have that Pr(£4) — > as n — > 00 if 
ilog|V(")| </(l;5d|A)-<5e. 

Finally, we consider the case where there is no error, i.e., 

{X",X"{WuW2,V),A"{Wi),S:^,S2) G T("\X,X,A,Se,Sd). 

By the law of total expectation, the averaged distortion (over all codebooks € containing codewords (X", A^)) is 
given by 

Ec,xAd^''HX'\^")] ^Pt{£) ■ Ee^xAd'^"\X",X")\£]+PT{£'') ■ Ee^xAd^^H^^^^^W] 
< Pv{£) ■ dma. + PT{£'') ■ E^,xAd^"\X",X")\£% 
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where dmax is assumed to be the maximal average distortion incurred by the "error" events. 
Given the distortion is bounded by 

1 " 

= -Yd(xi,Xi) 

i=l 
a, 6 

a.h 

where (*) follows from the definition in (19). 
Therefore, we have 

i^CX-M^^H^",^")] < Pr('?) • rfmax + Pr(£:'=) • E{d{X,X)\ (1 + e). 
Similarly, we have for the average cost 

i?£[A(")(A")] < Pr(£:) • c™,, + Pr(£^) • i?[A(A)](l + e), 

where Cmaa; is assumed to be the maximal average cost incurred by the "error" events. 

By combining the bounds on the code rates that make Pr(f ) — > as n ^ oo and considering the constraint 
i log |yv(») I = i log |W}"^ ||W;i"^ I < E + ^, for any (5 > 0, we have 

+ 5 > 1 log > 7(X; A) + 7(X, 5e; - 7(X; + 5^, 

where can be made arbitrarily small, i.e., 5^ — )• as e — > 0. 

Thus, for any ^ > 0, if > I{X;A) + I{X,Se\X\A) - I(X;Sd\A), E{d{X,X)\ < 7) and E{k{A)\ < C, 
then we have Pr(£) — >^ as n — >■ oo, and for all sufficiently large n, 

< D + 5,< D + 5, 

£;e;[A(")(A")] <C + 5e<C + 6. 

Lastly, with Pr(£) — > as n — >^ oo, it follows that with high probability the decoded codeword X" = 
W2, v) at the decoder is the correct one which was chosen at the encoder. We recall the encoding process 
which determines the codeword X" based on and A", i.e., there exists a mapping ^(")(-) such that 

Xn = ^(")(X",S'^,vl"). Thus, for any 5 > 0, we can have Pr{tp^''\X'',S^,A'') g(-''\Wi,W2,S2)) < 5 for 
all sufficiently large n. 

The average distortion, cost, and connmon reconstruction error probability (over all codebooks) are upper-bounded 
by D + S,C + d and 6, respectively. Therefore, there must exist at least one code such that, for sufficiently large n, 
the average distortion, cost, and connmon reconstruction error probabiUty are upper-bounded by D + 6,C + 6 and 
5. 

Thus, any {R,D,C) such that we have R > I{X;A)+I{X,Se;X\A) - I{X;Sd\A), E[d{X,X)] < D, and 
-E[A(A)] < C for some Px{x)PA\x{o-\x)Ps^,s^\x,A{se, Sd\x, a)P^|^ ^) achievable. This concludes 

the achievabiUty proof. ■ 
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B. Converse Proof of Theorem 1 

Let us assume the existence of a specific sequence of (|W^"^ I, n) codes such that for 5„ > 0, ^log|Wi"^||W2"^| < 
R+5n,lE[Yri=^d{Xu9i)]<D+5n, i£ [Er=i ^(^0] < and Pr(^(")(X", S,", A") ^ W2, ^2)) < 

5n, where gi denotes the i^^ symbol of ff^^^Wi, W2, .SJ) and lim„_,.oo = 0. Then we will show that R > 
Rzc,ct{D-, C), where R^c,ct{D, C) is the rate-distortion-cost function defined as 

R^cAD,C)= min [I{X-A)+I{X-X,Se\A,Sd)]. (20) 

With Pr(V;(X",S'^,^") 7^ g^'^\WuW2,S2)) = 5'^ < ^n, and \X\ = \X\, the Fano inequahty can be applied 
to bound 

H{'^^'^\X-,S2,A-)\g^-\W^,W2,S2)) < h{S'J + 6' Jog{\xr - 1) = ne„, (21) 

where h{5'^) is the binary entropy function, and e„ — > as (5^ — > 0. 
Then the standard properties of the entropy function give 

n{R + Sn) > log (|W}"^ I • \wt^ I) > H{WuW2) 

H{Wi,W2,A") ^ H{A'')+H{Wi,W2\A'') 

> -i?(^"|X",5,")] + [H{Wi,W2\A'',S2) - H{Wi,W2\A'',X^,S:^,S2)] 

= H{X^, S^) - H{X^, + H{X^, 5J) -^(X", S^, Wi, W2), (22) 

=P =Q 

where in (*) we used the fact that A" = g^\Wi), and g^\-) is the deterministic function. Further, 

P = 5^) + H{S2\X'', S'^, A") - H{S'2\A'^) 

= HiX'') + H{S2\X") + H{Sl^,S2\X",A") - A") - H{S2\A") 

{*) " 

1=1 

n 

= Y,H{Xi) + H{S,4Xi,Ai) + H{Sd,i\Xi,S,,i,Ai) - H{SdAAi) 

n 

= J2 ^(^i) + HiSe,i\Xi, Ai) - H{Xi, SeMi) + H{Xi, SeMi, Sd,i) 

n 

= ^ I{Xi- Ai) + H{Xi, SeMu Sd,i), (23) 

i=l 
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where (*) holds due to the i.i.d. property of Px" and -Psp,sj|X",A". 
Q = S^, Wi,W2,g^''HWi,W2,S2)) 

> -H{X-, S:\A-, S:^,g^-\W„W2, S^)) 

+ H{i,^^\x^, S:, A")\A", S2, g^''\WuW2, S2), X", S:) 

(a) ^ 

(6) ^ 

> -nen-Y,H{Xi,SeAAi,Sa,i,^t'\x^,S:,A^)), (24) 

where (a) follows from (21) and V-"^ (X", 5'^, A") in (6) corresponds to the i*'' symbol of ^("^ (X", 5^, A"). 

Define Xi = 5^, A"). Then the Markov chain X" - - holds. Together with 

the memoryless property, Ps?,S2|X'',A"(Se ' = ]Xi=iPs^,Sd\x,A{se,i-,Sd,i\xi,ai), we also have that 

(S'^-S ST^', A"\S i"") - {Xi, Se,i, Ai) - Sd,i forms a Markov chain. Combining (22)-(24), we have 

i? + ^n>^l0g(|>V("V|Wf^l) 
1 " 

> - ^-f(Xj; Ai) + I{Xi;Xi,Se,i\Ai, Sd,i) - 

z— 1 

(b) / 1 " 1 " \ 

> iJaccr -^£;WXi,Xi)],-^£;[A(Ai)]J -e„, (25) 

\ i=l i=l / 

where (a) follows from the definition of i?ac,cr(^, C') in (20), and the fact that X^ — {Xi, Se,i, Aj) — Sci,i forms a 
Markov chain, (6) follows from Jensen's inequahty and convexity of Rnc,cr{D, C). 

Let P be the event that the reconstruction at the encoder is not equal to that at the decoder, i.e., P = {^("' (X" ,S",A" 
g('^\Wi,W2, S2)}. It then follows that 

E[d{X,,g,)]=E[d{X„gi)\(3']-Py{p-) + E[d{X,,g,)\/3]-Pr{/3) > E[d{Xi, Xi)\/3'] ■ Py{(3^), (26) 

where (*) holds because we have gi = Xi for given {3^. Thus 
-I n 1 ^ 

-J2E[d{Xi,Xi)] = -J2E[d{Xi,XiW]-PT{p'^)+E[d{Xi,Xi)\l3]-Pr{l3) 

i=l i=l 

(6) 1 " 

n ^ — ' 

i=\ 

(c) 

< D + {1 + dmax)Sn, (27) 
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where (a) follows from the assumption that dmax is the maximum average distortion incurred by the error event 
j3 and that Pr(^(")(X",5^,^") ^ g^''\Wi,W2, S2)) < Sn, (b) follows from (26), and (c) follows from the 
assumption that j^E d{Xi,gi)] < D + 5n in the beginning. 

Finally, we substitute (27) into (25). With lim„_,.oo Sn = 0, and lim„^oo = 0, we thus get R > Rac,ct{D, C) 
by using the assumption that ^(^<)] < + and the non-increasing property of Rac,a{D, C). This 

concludes the proof of converse. ■ 

Appendix C 
Converse Proof of Proposition 1 

Let {Wi,W2) e {1. 2, ... , \W["^\} X {1, 2, ... , denote the encoded version of X" where = 

|>V^"''| • |yV2"''|. Let us assume the existence of a specific sequence of (|W^"'|,n) codes such that for (5„ > 0, 
i log < R + 5n, i£ [Er=i diX^,Xi)^ < D + Sn, [YJl^^ HA,)] < C + <5„, where Xi denotes the 

i^^ symbol of X" = g^'^^W, S2) and lim„^oo S,, = 0. Then we identify U and g : U x ^ X and show that 
R > Rac{D, C), where i?ac(-D, C) is the rate-distortion-cost function defined as 

R^c{D,C)= min [I{X; A) + I{U; X, Se\A, Sa)]. (28) 

We start bounding the rate as in (22), 

n{R + 5n) > log (I I) > H{W) 

> S^) - S^IA"") + S^IA"", S^) S^\A^, W) (29) 

^ V V ' 

=P =Q 

The term P is given as in (23), 

n 

P > J2l{Xi;Ai)+H{Xi,Se,i\Ai,Sd,i) (30) 

i=l 

and 

>J2-H{Xi\A^,S2,W,X'-')-H{Se,i\A''^S2,W,X'-\Xi) = -J2H{Xi,Se,i\Ui,AuSd,i), (31) 

i=l i=l 

where = {A"\\ S2'^\W, X'-^), i = l,2,...,n. 
Combining (29)-(31), and letting n — >^ oo, we have 

n 

nR>J2 Ai) + I{Ui; X^ S^AAi, Sa,i) 

(a) ^ 

> R^(E[d{Xi,gi{Uu Sd,i))] , E[A{Ai)]j 

i=l 

(6) / 1 " 1 " \ 

> nR,J-Y,E[d{Xi,gi{Ui,Sd,i))]^-12^^^^^'^n 

\ i=l 1=1 / 

> nR^c{D,C), 
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where (a) follows from the definition of rate-distortion-cost function in (28) and the fact that Ui — {Ai ,Xi,Se,i)— Sd^i 
forms a Markov chain, and that = g^"'\w,S2) — gi{Ui,Sd,i) for some (b) follows from Jensen's inequality 
and convexity of Rac{D, C) which can be proved similarly as in [7] or Lenmia 1, and (c) follows from the non- 
increasing property of R^{D, C), ^^[ELi d{Xi,Xi)] < D + 6^, and ^-E[Er=i H^i)] < C + S^. 

For the bound on the cardinaUty of the set of U, it can be shown by using the support lemma [29] that U should 
have 1^1 1^1 — 1 elements to preserve Pa,x, plus four more for /([/; X, Se\A), I{U ; Sd\A), the distortion, and the 
cost constraints. This finally concludes the proof. ■ 

Appendix D 
Proof of Theorem 2 

A. Achievability Proof of Theorem 2 

Similarly to the previous achievability proof in Theorem 1, the proof follows from a standard random coding 
argument where we use the definition and properties of e-typicality as in [28]. We use the technique of rate splitting, 
i.e., the message M of rate R is split into two messages M\ and M2 of rates -Ri and R2. Two-stage coding is 
then considered, i.e., a first stage for conmiunicating the identity of the action sequence, and a second stage for 
connmunicating the identity of X" based on the known action sequence. 

For given channels with transition probabilities Ps^,Sd\A{se-, Sd\o) and PY\x,Se,Sd{y\^^ ^e, Srf) we can assign the 
joint probabihty to any random vector {A, X, Sg) by 

PA,s,,Sd,x,Y{a, s,,, Sd, X, y) = PA{a)Ps^,SM'^^''^ ^d\a)Px\A,sSAo; Se)PY\x,s,,sAy\^^ ■*e, Sd) 

Codebook Generation: Fix Pa and Px\A,s^- Let = {1, 2, . . . , |}, M^"^ = {1, 2, . . . , |} and 

= {1,2,..., I |}. For all nii G M^i \ randomly generate a"(mi) i.i.d. according to nr=i ^Aiai)- For 
each mi S generate \M^2^^ \ ■ |j7'-"-'| codewords, {x"'{mi,m2, i.i.d. each according to 

Yli=i Px\A{^i\^i{iT^i))- Then the codebooks are revealed to the action encoder, the channel encoder and the decoder. 
Let < eo < ei < e < 1- 

Encoding: Given the message rn = (mi, 77*2) G Al^"^, the action codeword a"(mi) is chosen and the chan- 
nel state information {s",s^^) is generated as an output of the memoryless channel Pgn gnj^n (s", .s'||a") = 
Y[i=i Pse.Sd\A{se,i, Sd.i\o.i)- The encoder looks for the smallest value of 7 G J^"^ such that (,s". a"(m,i). .7:" (m,i, m2, j)) G 
Tj^\Se, A, X). If no such j exists, set j = 1. The channel input sequence is then chosen to be a;"(m,i, rH,2- ,?)• 

(n) 

Decoding: Upon receiving y" and s''^, the decoder in the first step looks for the smallest rhi M\ such that 
(2/",s2,a"(™i)) e ri"\r,5rf..4). if successful, then set 777,1 = "ii- Otherwise, set 7?7i = 1. Then, based on 
the known a"(77ii), the decoder looks for a pair (7772,.?) with the smallest 7712 G A^2"^ •? ^ iJ'-"-' such that 
(y", s'2, a" (mi), x" {mi, 7722, j)) G Ti"^(y, Sd, A, X). If there exists such a pair, the decoded message is set to be 
m = (mi, 7712), and a;" = x"{rhi,rh2,j). Otherwise, tti = (1, 1) and x" = a;"(l, 1, 1).^ 

'We note that although the simultaneous joint typicality decoding gives us different constraints on the individual rate as compared to the 
sequential two-stage decoding considered in this paper, it gives the same constraints on the total transmission rate in which we are interested. 
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Analysis of Probability of Error. Due to the symmetry of the random code construction, the error probability 
does not depend on which message was sent. Assuming that M = (Mi, M2) and J were sent and chosen at the 
encoder. We define the error events as follows. 

fi = {A"(Mi)^T4")(A)} 

£2 = {{S:,S^,A^{M,)) ^ T,^^\S,,Sd,A)} 

£3 = {(5:, A"(Mi),X"(Mi,M2,j)) ^ 4")(5e, A^) for all j G J^"'} 
£ia = {(F",5J,^"(Mi)) ^ Ti")(y,5d, A)} 

£•4^ = {(r",5^,A"(rfii)) e Tj^">{Y,Sd,A) for some mi e AlJ"\mi 7^ Mi} 
£5a = {{¥^,32, (Ml ) , X" (Ml , M2 , J)) ^ Ti") (F, ^, X) } 

£56 = {(y",5J,A"(Mi),X"(Mi,m2,j)) e Ti")(y,5d, AX) for some (m2j) e x j("),(m2,j) 7^ (M2, J)}. 

The probability of error events can be bounded by 

Vv{£) < Pv{£i) + Pv{£2 n ff) + Pr(£:3 n fl) + Pr(£:4a n fl) + Pr(£:46) + Pr(£:5a n £i) + Pv{£5b), 

where £f denotes the complement of event £i. 

1) Since ^"(Mi) is i.i.d. according to Pa, by the LLN we have Pr(£i) — >■ as n — >■ cx). 

2) Consider the event where we have ^"(Mi) e t4"^(A). Since {S2,S'^) is distributed according to 
nr=i Pse,Si\A{se,i, Sd,i\ai), by the conditional typicahty lemma [28], we have that Pr(f2 H £1) — >^ as n — >^ 00. 

3) By the covering lemma [28] where X" is i.i.d. according to nr=i Px\A{xi\ai), we have Pr(53n5|) as 
n 00 if i log > /(X; ^el^) + ^^i, where ^- as ei 0. 

4a) Consider the event £§ where we have (S'^,^"(Mi),X"(Mi,M2, J)) e Ti"^(S'e, A,X). Since we have 
Sd — {A, Se) — X forms a Markov chain and is distributed according to HILi -Psd|A,Se (^d.il'^ij *e,i))> we have 
that by the conditional typicahty lemma [28], Pr((S'J,5^, A"(Mi),X"(Mi,M2, J)) G T^''\Sd,Se,A,X)) 1 
as n — >^ 00. And since we have the Markov chain A — {X,Se,Sd) — Y and F" is distributed according to 
n"=i PY\x,Se,Sd{yi\xi, Se,i, Sd,i), by using once again the conditional typicality lermna, it follows that 

Pr((r", S-^, S2, A"(Mi), X"(Mi, M2, J)) € ri")(F, A, S^, Sd, X)) ^ 1 as n ^ 00. 

This also implies that Pr(f4a n ) — ;> as n 00. 

4b) By the packing lemma [28], we have Pr{£4b) ^ as n ^ 00 if ^ log | < I{A; Y, Sd) - <5e, where 
(5e -5- as e ^ 0. 

5a) As in £40) we have Pr(i?5Q n E^) — > as n 00. 

5b) Averaging over all J = j, by the packing lemma where X" is i.i.d. according to YVi=i Px\A{xi\(ii), we have 
Pr(£;56) ^ as n ^ CX) if i log \M^^^ | + ^ log | < /(X; Y, Sd\A) - 6,. 
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Finally, by combining the bounds on the code rates that make Pr(f ) — > as n — >^ oo, 

-log\J^"^\> I{X;Se\A) + d,, 
n 

^log|A^l")| </(^;y,5d)-5e 
- log \M^^^ I + - log I < I{X- Y, Sd\A) - 5„ 



we have 



^ log I A^(") I = ^ log I Al^"^ I \M^^^ I < I{A, X; Y, Sd) - I{X- Se\A) - 6', 

1 log \M^^^\ < i{x- y, SM) - HX; s,\A) - 5':, 

where (5^, (5" — > as e ^ 0. 

Since, for any S > 0, the achievable rate R satisfies ^ log |Ai ^"^ | > R — d, and we know that ^ log \M2^^ | > 0, 
then we get 

R-S< -log I < I{A, X; Y, Sd) - I{X; S,\A) - 6', 
n 

and Q<-\og\M'^^\<I{X-Y,Sd\A)-I{X-S,\A)-5':. (32) 
n 

Since e can be made arbitrarily small for increasing n, and by a standard random coding argument, we have that 

R<I{A,X-Y,Sd)-I{X-S,\A) 
and 0<I{X-Y,Sd\A)-I{X-Se\A), 

for some PA(a)Ps^,Srf|^(se, Sd|a)fx|A,s»(a;|a, Se)-PY|x,s,,Sd(yk> Se, Sd) is achievable. 

Note that the latter condition is for the two-stage coding to be successful, i.e., we can split the message into two 
parts with positive rates. This concludes the achievabihty proof. ■ 

B. Converse Proof of Theorem 2 

We show that for any achievable rate R, it follows thati? < I{A,X;Y,Sd)-I{X;Se\A) and < I{X;Y, Sd\A)- 
I{X; Se\A) for some PA{a)Ps,^Sa\A{se, Sd\a)Px\A,Se{x\a, Se)PY\x,Se,sAy\^' ^e, Sd)- From the problem formula- 
tion, we can write the joint probabiUty mass function. 



-'-^f(")^^^-„n f(n)|-^ <,n-,_o.n „('')|-„n gnw^n^gO;')(y,._g,.,=^. f-r i n„ /I X 

' [[Ps^,Sa\A{Se,i,Sd,i\ai)PY\X,Se,sAyi\^i^Se,i,Sd,i), 

(33) 



|A1(")| 



where M is chosen uniformly at random from the set A4^"^ = {1, 2, . . . , | |}. 

Lemma 2: For the joint pmf in (33), Sd,i - {Ai, Se,i, Xi) - (M, X", F'-i, 5^-^) forms a Markov 

chain. 
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Fig. 11. Graphical proof of the Markov chain S^^i - (Ai,5e,i,Xi) - (M, A"\', 5Jj_,_i, X", Y*"!, 5^"^) with the marginal pmf derived 
from the joint pmf in (33) by summing out (Yj^^, M, X"). 



Proof: From (33), we use the undirected graph as a tool to derive the Markov chain [30], [31]. Let 

V^{Ai,S,,i,Xi) 

We can draw the undirected graph associated with the marginal pmf derived from the joint pmf in (33) in Fig. 11. 

Since all paths in the graph from a node in W to a node in W pass through a node in V, we have that W — V — U 
forms a Markov chain. Therefore, Sd,^ - (A„ Se,^, Xi) - (M, S';^.^^, X", Y^-^,S]f'^) forms a Markov chain. 

■ 

Let us assume that a specific sequence of (jA^*^"^ j, n) codes exists such that the average error probabilities Pm}; = 



5'^ < 6n and = S'^ < 5,,, and log \M 



(n)| 



n{R - 6'^) > n{R - with lim. 



0. Then standard 



properties of the entropy function give 

n{R - Sn) < log |>1 I = H{M) 

= H{M) - (X", M|F", S^) + if(M|y", S^) + if(X"|M, F", 5J) 

< H{M) - M|y", s^) + i?(M|y", s^) + s^) 

Consider the last two terms in the above inequahty. Similarly to [13], by Fano's inequahty, we get 

H{M\Y\ S2) < h{S'J + • log(2"(«-^n) _ 1) = nel^K 
H{X"\Y\ S2) < hi6'J + 5'^ ■ log{\Xr - 1) = ne(f ), 



35 

where h{-) is the binary entropy function, and e^"*^ — > 0, el^^ — >^ as (5„ — >^ 0. 

Let nel™^ + ne^^ = ne„, where e„ satisfies lim„_>.oo e„ = 0. Now we continue the chain of inequalities and get 

n{R - Sn) < H{M) - M|y", S^) + ne„ 

= H{M, S^) - H{S^\M) - M\Y", S^) + ne„ 

H{M, S2, X") - H{S2\M, A") - M\Y", S^) + ntn 

= H{M, S^, X") - - M|y", S^) + nen 

= /(X", M; F", S-^) - i7(S',"|A") + HiS:!\X", M) + ne„ 
''^ I{X'^,M-Y'',S2)-H{S'^\A'')+H{S'2\X'^,M,A'^)+nen 
= /(X", M; y", 55) - /(X", M; A") + ne„, 

where (a) foUows from the fact that X" = /("^(M,^^) and A" = fa'\M), (b) holds since is independent 
of M given A", and (c) from A"" = fi''\M). 
Continuing the chain of inequalities, we get 

n{R -5n- Cn) 

n 

< I{X^, M; Yi, Sa,i\Y'-\ S'f') - 7(X", M; 5e,i|5,",,+i, A") 

i=l 

n 

- [/(X", M, y*-i, 5^ \ ^e.il^eVi, - I{Y'-\S'i^;S,,i\X^, M, S:^i+„A")] 

n 

I{X^, M, A"; Yu Sd,i\Y'-\ S'f') - 7(X", M, y^-\ 5^ 5e,i|5,Vi, 

i=l 

n 

- [H{S,,i\S-^^,,A-)-H{S,,i\S-,^,,A-,X-,M,Y^-\S^^-')] 

(6) 

< ^ff(yi,5d,i) - H{Yi,Sd,i\Zi,Ai) - H{Se,i\Ai) + H{Se,i\Zi,Ai) 

i=l 

n 

= Y Zi-, Yi, Sd,i) - I{Zi; Se,i\Ai) 

i=l 

= 7(Ai, Zi,Xi; Yi, Sd^i) — I{Zi, Xi] Se,i\Ai) 

i=l 
n 

= Y H^i: Yi, Sd,i) + I{Zi, Xi- Yi, Sd,i\Ai) - HZi, Xi- Se,i\Ai), 

where (o) follows from the Csiszar's sum identity in [29], Y^^^^i HSe,^+l, A'';Y,, Sd,^\X'' , M,Y^-^ , S'f^) - 
I{Y''\ Sy^; Se,^\X'' , M, S^,^-^, A'') = and additionally using A"- = fi''\M), (b) follows from the fact 
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that -Ai- Se,i forms a Markov chain and by defining Zt = (M, A"V,6'^_.+i,X",y'-i,S'^-i), 

and (c) follows from the definition of Zi. 
Consider the sum of the last two terms, 

n 

/(Zi, Xi; Yi, Sd,i\Ai) ~ I{Zi, Xi, Se,i\Ai) 

i=l 

( ) ^ 

= ^ ] I{Zi, Xi] Yi, Sd^i\Ai, Se^i) — I{Zi, Xi] Se^i\Ai, Yi, Sd,i) 

1=1 

n 

= ^ ^ H{Yi, Sd,i\Ai, Se,i) — H(Yi, Sd^i\Ai, S^^i, Zi, Xi) — I{Xi; 56,11^4,, 1^, Sd,i) — I{Zi; Se,i\Ai,Yi, Sd^i, Xi) 

i=l 
n 

^ ^ ^ H{Yi, Sd,i\Ai, Se,i) — H[Sd,i\Ai, Se,i, Zi,Xi) — H(Yi\Ai, Se,i, Zi,Xi, Sd,i) — I{Xi; Sg,^i\Ai, Yi, Sd,i) 

i=l 

= ^ H{Yi, Sd,i\Ai, Sg^i) — H[Sd,i\Ai, Sg^i, Xi) — H(Yi\Ai, S^^i, Xi, Sd,i) — I{Xi; Se,i\Ai,Yi, Sd,i) 

i=l 
n 

= y ^ liYi, Sd,i', Xi\Ai, Se,i) — I{Xi; Sg,^i\Ai, Yi, Sd,i) 

i=l 
n 

I{Xi; Yi, SdAAi) - I{Xi; Se,i\Ai), (34) 

i=l 

where (a) follows by adding and subtracting the term Y^"^-^I{Zi,Xi;Yi,Sd,i,Se,i\Ai), (6) follows from the 
memoryless property of the channel where {Zi,Ai) — {Xi, Se,i, Sd,i) — Yi forms the Markov chain and also the 
Markov chain Sd,i — {Ai,Se,i,Xi) — Zi (see Lenmia 2), and (c) follows by adding and subtracting the term 
Er=i HYi, Se,i, Sd,i; Xi\Ai). Finally, we get 

71 

n{R -6n-en)<J2 ^(^^i ^<i,i) + HXi-, Yi, SdM - HXi; (35) 

Next we prove the constraint which does not involve rate of the connmunication. It can be considered as the 
restriction imposed on the set of input distribution in a similar flavor as the dependence balance bound in [32]. 
From the standard properties of the entropy function, we observe that 

< i7(M|A") 

= (M|^") - H{X", M\Y", S2, + (M|F", S2,A") + if (X"|M, Y", S^, A") 
< H{M\A"-) - M|y", 5J, A"-) + iI(M|y", S^) + iI(X"|y", S2) 

Again, consider the last two terms in the above inequality. By Fano's inequality, we get 

iJ(M|y", S2) < h{S'J + ■ log(2"(«-^^) - 1) = nei™), 
H{X^\Y\ S^) < h{S'J + ■ logQXr - 1) = nei^l 
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Let nel"*' + ne^^^ = ne„, where e„ satisfies lim„^oo Cn = 0. Now we continue the chain of inequalities and get 

-ncn < H{M\A'^) - H{X", M\Y'\ S*^, 

= H{M, S^IA"") - H{S^\M, A"") - JT(X", M|F", S2, A") 
H{M,S^,X''\A'')-H{S^\M,A'')-H{X'',M\Y'',S^,A'') 
i?(M,5^,X"|y4") -i?(5,"|A") -iI(X",M|y",S'J,A") 

= I{X", M; y", S'JIvl") - /(X", M; S^\A''') 

71 

= ^/(X",M;y,,5d,,|y'-\5r\^")-/(^",M;5e,i|5,%i,^") 

(c) ^ 

< ^H(yi,Sd,i|y*-^5r^^o-^(yi,^c^,d>''-^5^^^i'^^^)-A^^^;^e,il^eV^ 

i=l 

n 

= ^ J(X", M; Yi, Sa,i\Y'-\S\-\Ai) - I{X", M; Se,i\S:,i+„A^) 

i=l 

n 

M, y,, 5d,,|y*-\ \ Ai) - y,^ 

- y , M, Y'-\S'-') 

where (o) follows from the fact that X" = /(")(-^> 5'^). (^) holds since is independent of M given (c) 
holds since A" = fa{M), and (d) follows from the fact that the last term is zero since A" = f^\M). 
Continuing the chain of inequahties, we have 

n 
i=l 

- [/(X", M, y'-\ 5^'; S,,i\Sl,+^,A^) - I{Y'-\ Sl-^;S,,i\X^, M, A")] 

n 

i=l 

n 

= Y}H{Yu Sd,i\Y'-\S\-\Ai) - H{Yi, Sa,i\Y'-\ S'^-\ X^ , M, S^.+^A^)] 

i=l 

- [if(5e,i|5r,,+i, A") - if (5e,,|5,",,+i, A", X", M, y'-i, ')] 

< J2H{Y^,Sd,^\A^) - H{Y,,Sd,i\Zi,A,) - H{Se,^\A^) + H{Se,^\Z^,Ai) 
i=l 

( ) ^ 

= ^(-^i) -'^i; Yi, Sd,i\Ai) — I{Zi,Xi; Se,i\Ai) 

i=l 

< /(Xi; y^, Sa,i\Ai) - HXi-, Se,i\Ai), (36) 

i=l 
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where (a) foUows by the Csiszar's sum identity, X;r=i -4"; F^IX", M, Y'-'^)-I{Y'-'^; 5e,i|X", M, = 

and ^" = fi"\M), (6) holds by using the Markov chain (S^.+i, - - 5e,i and by defining Zi = 
(M, S'^^^+i, X", F'-i, 5^"^), (c) follows from the definition of Zu and (d) follows from the same steps as 
in obtaining (34). 

Let Q be a random variable uniformly distributed over {1, . . . , n} and independent of (M, A", X", 5", 5^, F"), 
we can rewrite (35) and (36) as 

1 " 

R < -^H^i, Xi- Yi, SdAQ = i)- I{Xi; 5e,.|A., Q = i) + Sn + en 

i=l 

= I{AQ,XQ;YQ,SdM-I{XQ;Se,Q\AQ,Q) + Sn + en 

and 

< - V /(Xi; Yi, SdMu Q = i)- I{Xi- S,Mi^ Q = i) + en 
n f 
1=1 

= I{Xq; Yq, Sd,Q\AQ,Q) - I{Xq; Se,Q\AQ,Q) + e„. 

Nowsince wehavethatPs^^^s^ ^^l^^^ = Pse,5.|A, Pyq\Xq^s,.q^s,,q = -Py|x,Se,Sd' and ^q-(Xq, 5e,Q, S'd.g)- 
Yq forms a Markov chain, we identify A = Aq, 5*6 = Se.q, Sd — Sd.q, X = Xq, and y = Yq to finally obtain 

R < I{A, X; Y, Sd\Q) - I{X; Se\A, Q) + <5„ + e„ 

and < I{X; Y, Sd\A, Q) - I{X; Se\A, Q) + e„, 

for some joint distribution 

PQ{'l)PA\Q{a\q)Ps,,Si\A{Se, Sd\a)Px\A,S^,Q{x\0; Se, q)PY\X,S,,Sa{y\^' ^d)- (37) 

Lemma 3: From the joint distribution in (37), {Y, Sd) — {X, A,Se) — Q and Se — A — Q form Markov chains. 
Proof: We use a (partial) fist of properties satisfied by the Markov chain (the conditional independence relation) 
in [31]. As a quick reference, we restate it in the following. Let W, X, Y, Z be the random variables, and " => " 
refer to "imply", 

weak union : X - Z - {W,Y) ^ X - {Z,W) -Y 

contraction : X - Z -Y md X - {Z,Y) - W ^ X - Z ~ {Y,W). 

From (37), the following Markov chains are readily derived. 

Q-A-{Se,Sd) (38) 

X-{A,Se,Q)-Sd (39) 

{A,Q)~(X,Se,Sd)-Y (40) 

By the weak union property, we can derive from (38) the Markov chain Q — {A, Se) — Sd- Then combining it with 
(39), by using the contraction property, we get the Markov chain 

{X,Q)-{A,Se)-Sd (41) 



Again using the weak union in (40) and (41), we get 

Q-{X,A,Se,Sd)-Y 
and Q-{X,A,Se)-Sa 

Combining these two Markov chains using the contraction property, we finally get the Markov chain Q—{X, A, Sg) — 
{Y,Sd). m 
To this end, we note that under any distribution of the form above, we have 

I{A, X; Y, Sd\Q) - I{X- Se\A, Q) = I{A, X, 5^; F, S^IQ) - I{X, Y, S^: SM, Q) 

= H{Y, Sa\Q) - H{Y, Sa\A, X, S^, Q) - H{S,\A, Q) + H{S,\X, Y, Sa, A, Q) 

< H{Y, Sd) - H{Y, Sd\A, X, Se, Q) - H{Se\A, Q) + H{Se\X, Y, Sa, A) 
H{Y, Sd) - H{Y, Sd\A, X, Se) - H{Se\A) + H{Se\X, Y, Sd, A) 

= I {A, X, Se;Y, Sd) ~ I{X, Y; Se\A) 
= I{A,X;Y,Sd)-I{X;Se\A), 

and 

/(X; Y, Sd\A, Q) - I{X- Se\A, Q) = I{X, Se;Y, Sd\A, Q) - I{X, Y; Se\A, Q) 

= H{Y, Sd\A, Q) - H{Y, Sd\A, X, Se, Q) - H{Se\A, Q) + H{Se\X, Y, Sd, A, Q) 

< H{Y, Sd\A) - H{Y, Sd\A, X, Se, Q) - H{Se\A, Q) + H{Se\X, Y, Sd, A) 
'-^ H{Y, Sd\A) - H{Y, Sd\A, X, Se) - H{Se\A) + H{Se\X, Y, Sd, A) 

= I{X, Se; Y, Sd\A) - I{X, Y; Se\A) 
= I{X;Y,Sd\A)-I{X;Se\A), 

where both equahties (*) follows from the Markov chains {Y, Sd) — {X, A, Se) — Q and Se — A — Q (derived from 
(37), see Lemma 3), and the joint distribution of {A, Se, Sd, X, Y) is of the form 

X] PQi^)PA\Q{a\Q)Ps,,Sj^\A{Se, Sd\a)Px\A,S,,Q{x\a, Se, q)PY\X,S,,sAy\^^ Se, Sd) 
= PA{a)Ps,,Sd\A{Se, Sd\a)Px\A,sSAO; Se)PY\X,Se,Si{y\^^ Sd)- 

The proof is concluded by taking the limit n — ?■ oo. ■ 

Appendix E 

Proof of convexity of the region TImod with dummy variable R 

Consider the achievable rate < J? < I{A, X\ Y, Sd) - I{X\ Se\A) for some PA{a), Px\a,sS^\0'^ ^e) such that 
< I{X; Y, Sd\A) - I{X; ^el^l). We modify it by introducing a dummy variable R which can take either positive 
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or negative value, and we obtain the modified "region". The modified region 7?mod is the set 

7^mod = {{R, R) : <R < I [A, X-Y,Sd)-I{X- S,\A) 

R<I{X-Y,Sd\A)-I{X;S,\A) 

for some PA{a)Ps,,Si\A{se, Sd\a)Px\A,sSAo-^ Se)PY\X,Se,Si{y\x, Se, Sd).} 

We will show that the region above is convex. Assuming that any two arbitrary points {R^,R^) and {R^,R^) g 
TZmod- This implies that there exist distributions 

^i!k,Sd,x,yK«e,Sd,a;,2/) = P^\a)Ps,,s^\A{se,Sd\a)P^^^j^s^{x\a,Se)PY\x,s,,sAy\^^Se,Sd) 

and PA!se,Sa,X,Y{a^Se,Sd,X,y) = P^^\a)Ps^,SM{Se,Sd\a)P^^l^^s^{x\a, Se)PY\x,Se,sM^'^e,Sd) 

such that 

<i?(i) < /(I) {A, X; Y, Sd) - I^'^ {X; Se\A) 
rW <I^'\X;Y,Sd\A)-I^'\X;Se\A) 
and <i?(2) < 7(2) X- Y, Sd) - I^^\X- S^\A) 

^(2) </(2)(X;y,5d|A)-/(2)(X;5e|A), (42) 

where /^'^ (■) denotes the mutual information associated with P4 5 Sd xy l*^' -^e, s^, x, y), i = 1,2. 

Now let Q be an independent random variable taking value from {1, 2}, where Pr((5 = 1) = 1 — Pr((5 = 2) = 
a, < Q < 1. Then we have the joint distribution 

-PQ,A,Se,Sd,x,r (g, a, Se, Sd, X, y) = PQ{q)PA\Q{a\q)Ps,,Si\A{se, Sd\a)Px\A,Se,Q{x\0'^ ^e, q)PY\x,Se,sAy\^^ ^e, Sd) 

(43) 

where PA\Q{a\q) = Pa\<^) and Px\A,s,,Q(x\a,Se,q) = -Px|!4,sja;|a, Se) for g = 1,2. 
Consider now the marginal distribution (averaged over Q) 

PA,SeA,X,Y{a, Se, Sd, X,y)= ^ PQ{q)PA\Q{a\q)PSe,Sa\A{Se, Sd\a)Px\A,S,,Q{AO; Se, 9)^^ |X,Se A (?/ Se, Sd) 

9=1,2 

which is associated with the mutual information terms I{A, X; Y, Sd)-I{X; Se\A) and I{X; Y, Sd\A)-I{X; Se\A). 
It follows that 

I{A, X; Y, Sd) - I{X- Se\A) = I{A, X, Se; Y, Sd) - I{X, Y, Sd; Se\A) 
> I{A, X, Se; Y, Sd\Q) - I{X, Y, Sd; Se\A, Q) 

= a[I^^\A,X,Se;Y,Sd) - I^^\X,Y,Sd;Se\A)\ + (1 - a)[I^^\A,X,Se;Y,Sd) - I^^\X,Y,Sd;Se\A)] 
= a[7(i) {A, X; Y, Sd) - /'^^ {X; Se\A)] + {1 - a) [l'^) {A, X; Y, Sd) - I^^^ {X; Se\A)], (44) 
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and 

I{X; Y, Sd\A) - Se\A) ^ I{X, S^, Y, Sd\A) - H^, Y, S^; Se\A) 
> I{X, S,- Y, Sa\A, Q) - I{X, Y, S^; Se\A, Q) 

= a[7(^) {X, Se; Y, Sa\A) - I^'^ {X, Y, S^; Se\A)] + (1 - a) [j'^) {X, S^, Y, Sd\A) - I^^^ {X, Y, S^, Se\A)] 

= a[I^'HX;Y,SM) - I^^HX;S,\A)] + {1 - a)[I^^\X;Y, S^) ' I^^HX; SM)], (45) 

where both inequalities follow from (Y, Sd) — {X, A, Se) — Q and Se — A — Q obtained in Lemma 3. 

From (42), (44), and (45), it follows that there exists a distribution PQ^A,Sc.Sd,x.Y{q,a, •''e, Sd,x,y) as in (43) 
with marginal factorized as PA{a)Ps,,Sa\A(^e, Sd\a)Px\A,sA^\(^^ ^e)PY\x,s,,sAy\^^ ^e, Sd) such that 

I{A, X; Y, Sd) - I{X; Se\A) > aR^^^ + (1 - a)^'^) > o 

I{X- y, Sd\A) - I{X; Se\A) > a^(^) + (1 - a)M'^'> (46) 

By the definition of T^mod and (46), we have that 

{aR^ + (1 - a)i?^ R^ + {1- a)R?) e Tl^oi- 

This implies that any convex combination of points {R, R) G T^mod is also in the set 7^od> and thus T^mod is convex. 
■ 

Appendix F 
Proof of Proposition 4 

A. Achievability Proof of Proposition 4 

Similarly to the previous achievability proof, the proof follows from a standard random coding argument where 
we use the definition and properties of e-typicality as in [28]. We use the technique of rate splitting, i.e., the message 
M of rate R is split into two messages Mi and M2 of rates R\ and R2. Two-stage coding is then considered, 
i.e., a first stage for communicating the identity of the action sequence, and a second stage for communicating the 
identity of S*" based on the known action sequence. 

For given channels with transition probabilities -Pse,Sd|A(se, Sd\a) and PY\x,s^,Sd{y\^^ ^e, Sd) we can assign the 
joint probability to any random vector {A, X, Se) by 

PA,Se,Sa,X,Y{a, Se, Sd, X, y) = PA{a)Ps^,Sd\A{Se, Sd\a)Px\A,S, {x\a, Se)PY\X,Se,sAy\^^ ^e, Sd) 

Codebook Generation: Fix Pa and Px\a,s,- Let Mf^ = {1, 2, . . . , |A1^"^|}, A^^"' = {1, 2, . . . , and 
={1,2,..., I J^") |}. For all mi e jWf' , generate a"(mi) i.i.d. according to HLi PA{ai). Then for each mi, 
generate |A^^"^||J'("'| codewords ("^i; ™2,i)}„^£^(n)_^.g^(„) i.i.d. each according to nr=i Pse\A{se,i\ai{mi)). 
Finally, for each (a", J") pair, generate a;" i.i.d. according to nr=i Px\Se,A{xi\se,i, ^^(mi)). Then the codebooks 
are revealed to the action encoder, the channel encoder, and the decoder. Let 0<eo<ei<e<l. 
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Encoding: Given the message m = (mi, 7712) € the action codeword a" (mi) is chosen and the chan- 

nel state information {s^,s2) is generated as an output of the memoryless channel, Pgr. ,5n|yin(s", s^|a") = 
nr=i Pse,Sd\A{se,i, Sd,i|«i)- The encoder looks for the smallest value of j € i7^"^ such that s" (mi, m2, j) = s". 
The channel input sequence is then chosen to be x^{m\,m2,j). If no such j exists, set j = 1. 

Decoding: Upon receiving y" and s^, the decoder in the first step looks for the smallest mi e A1^"^ such that 
(?/", s^, a"'(mi)) G T^^^{Y,Sd,A). If successful, then set mi = mi. Otherwise, set mi = 1. Then, based on 
the known a"(mi), the decoder looks for a pair {m,2,j) with the smallest m2 G and j G JT'"^ such that 

(?/",s^,a"(mi),s"(mi,m2,i),a;"(77ii,m2,j)) G T^^^{Y,S4,A,Se,X). If there exists such a pair, the decoded 
message is set to be m = (mi,m2), and the decoded state s" = s"(mi,m2,i). Otherwise, m = (1,1) and 

5^ = 1^(1,1,1)4 

Analysis of Probability of Error: Due to the symmetry of the random code construction, the error probability 
does not depend on which message was sent. Assuming that M = (Mi, M2) and J were sent and chosen at the 
encoder. We define the error events as follows. 

£^^{A-{M^)iTif{A)} 

^2 = {{S:,S2,A-{M^)) i T^^\S,,Si,A)\ 

£^a = {S: 7^ S,"(Mi, M2,i) for all j G 

£36 = { (5?, ^"(Mi), X"(Mi, M2, J)) ^ T4")(5e, A, X)} 

£Aa = {{Y\S2,A-{M,)) i Tt\Y,Sa^A)) 

£ih = {{Y^,S2,A^{m^)) G T^^\Y,Sa,A) for some mi G M^^\mi ^ Mi} 
£^a = {(F",5^,A"(Mi),S,"(Mi,M2, J),X"(Mi,M2, J)) ^ T^^\Y,Si,A,Se,X)} 
£56 = { (1^", S2, ^"(Mi), ^:(Mi, m2, j), X"(Mi, m2,j)) G t(^\Y, Sa, A, Se,X) 
for some (ma, 3) e >f^"' x (ma, j) ^ (M2, J)}. 

The probabiUty of error events can be bounded by 

Pv{£) < Pr(£i) + Pr(£2 n ff) + Pr(f3a n f|) + Pr(£36 n n £|) + Pr(£4a n £3^) + Pr(£46) 

+ Pr(£:5an£:|) + Pr(£:56), 

where £3 = £30 U S^b and £f denotes the complement of event Si. 

1) Since j4"(Mi) is i.i.d. according to Pa, by the LLN we have Pr(£'i) — > as n — 5- 00. 

2) Consider the event Ef where we have A" (Mi) e ri"-'(A). Since {S2,S2) is distributed according to 
nr=i -Pse,s<i|A(se,i, Sd,i|ai), by the conditional typicaUty lemma [28], we have that Pr(f2 H ff) — )• as n — >^ 00. 



*We note that although the simultaneous joint typicality decoding gives us different constraints on the individual rate as compared to the 
sequential two-stage decoding considered in this paper, it gives the same constraints on the total transmission rate in which we are interested. 
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3a) Consider the event 5| where we have [S^, S^, ^"(Mi)) G Ti"^(S'e, Sd, A). It foUows from the property of 
typical sequences [28] that Psn|^(s^|a") > 2-"[^('S'=l^)+*«il. Since both 5^ and 5^ are i.i.d. according to Ps,\a, 
we have Pr(£:3i n ^- as th- cx) if ^ log | > H{Se\A) + 5^^, where ^- as ei 0. 

36) Consider the event where J is selected and 5" = ^"(Mi, M2, J). Since X" is i.i.d. according to 
nr=i -Px"|Se,A(a;i|se,i, cti). by the conditional typicaUty lemma, we have that Pr(f3(, n fl £2) — >^ as n — >■ 00. 

4a) Consider the event 5| where we have (5^, A"(Mi),X"(Mi,M2, J)) € Ti"^(6'e, A^)- Since we have 
Sd — {A, Se) — X forms a Markov chain and S2 is distributed according to nr=i ^Sdl^.Se {^d,i\(ii, Se,i)> by the 
conditional typicahty lemma, we have that Pr((5J, 5^, A", X") e Tt\Sd, Se, A, X)) ^ 1 as n ^ 00. And since 
we have the Markov chain A—{X, Sg, 5^) — F and F" is distributed according to nr=i PY\x,Se,Sdiyi\ 
by using once again the conditional typicality lemma, it follows that Pr((F", S^, S^, A"(Mi), X"(Mi, M2, J)) e 
Ti"^(y, A, Se, Sd, X)) 1 as n 00. This also imphes that Pr(£:4a n ff) as n 00. 

46) By the packing lenoma [28], we have PT{£4b) as n ^ 00 if Mog|A1^"^| < I{A;Y,Sd) - S^, where 
(5e as e 0. 

5a) As in £40) we have Pr{£5a H £3) ^ as n — >^ 00. 

56) Averaging over aU J = j, by the packing lenmia where is i.i.d. according to nr=i Pse\A{se,i\cLi) and X" 
is i.i.d. according to nr=i -Py|Se,A(a;i|se,i,ai), we have Pr(£'56) ->• as n ->• 00 if Mog |A1^"''| + Mog | J"'")] < 
I{Se,X;Y,Sd\A)-6e. 

Finally, by combining the bounds on the code rates, 

-\og\J^"'>\> H{Se\A) + S,, 
n 

-loglM^^^^l <I{A;Y,Sd)-Se 
n 

- log \M^^^ I + - log I < I{Se, X; Y, Sd\A) - S^, 
n n 

where e > can be made arbitrarily small with increasing block length n, we have shown that, for any S > 0, with 
n sufficiently large, Pr{£) < S when R < I{A;Y, Sd) + I{Se, X;Y, Sd\A) - H{Se\A) and I{Se, X;Y, Sd\A) - 
H{Se\A) > for some PA{a)Ps^^Sa\A{se, Sd\a)Px\A,sA^W^ ■'^e)PY\x,s,,sAy\^^ ^e, Sd). Again, we note that the 
latter condition is for the successful two-stage coding, i.e., we can split the message into two parts with positive 
rates. This together with a random coding argument concludes the achievability proof. ■ 



B. Converse Proof of Proposition 4 

We show that, for any achievable rate R, it follows that R < I{A, X, Se] Y, Sd)-H{Se\A) and < I{Se, X; Y, Sd\A)- 
H{Se\A) for some PA{a)Ps,,Sd\A{se, Sd\a)Px\A,sM\a, Se)PY\x,s,,sAy\^' Sd). From the problem formulation, 
we can write the joint probability mass function. 



■^M,A'^,S^,S^,X",Y",M,S^ ('^' *e > ^d, x"' , , 171, s" ) 

n 

n Ps^,Sd\A{Se,i, Sd,i\ai)PY\x,S^,sAyi\^i^ *e,i, Sd,i), 



'''{/i"'(m)=a",/'"n"^.'i?)='^".9il?'(y".Sd)=«?-gm^(i/",«S)='ft} 
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where M is chosen uniformly at random from the set = {1, 2, . . . , | |}. 

Let us assume that a specific sequence of (|A^("^|,n) codes exists such that the average error probabilities 
Pm}e = S'r,< S„, pi"l = S'„ < S„, and log = n{R-5'J > n{R-S„), with lim^^^o = 0. Then standard 

properties of the entropy function give 

n{R - (5„) < log ^ H{M) 

= H{M) - HiX"-, S2, MIF", 5^) + iJ(M|F", 5^) + H{S^\M, F", S*^) + if(X"|M, 5^, F", 5^) 

< H{M) - 5^, M|y", 5J) + iI(M|y", SJ) + S'J) 

where (*) holds since is a deterministic function. 

Consider the last two terms in the above inequality. Similarly to [13], by Fano's inequality, we get 

H{M\Y\ S^) < hid'J + (5; • log(2"(«-^") - 1) = nei-), 
H{S:^\Y^, S2) < h{S'J + • logd^el" - 1) = ne^), 

I m) (s ) 

where h{-) is the binary entropy function, and e„ ^ 0, e„ ^ as n ^ oo. 

Let nei™' + nen^^ = ne„, where e„ satisfies lim„_>.oo e„ = 0. Now we continue the chain of inequalities and 
get 

n{R - 6n) < H{M) - HiX"", S^, M\Y^, S^) + ncn 

= H{M, S^) - H{S^\M) - if (X", S^, M|F", S^) + ncn 

H{M, S*;', X") - HiS^\M, A") - ff (X", S'^, M|y", S^) + ne„ 

H{M,S^,X'')-H{S^\A")-H{X'',Sl^,M\Y'',S2) + nen 

= J(X", S::, M; y", S^) - JJ(5:|A") + ne„ 

where (a) follows from the fact that X" = /(")(M, 5^) and A" = fi"'\M), (6) follows since 5^ is independent 
of M given A". 

Continuing the chain of inequahties, we get 

n{R -Sn- e„) 

(a) ^ 

n 

< Y,H{yuSd,i) - H{Yi,SdAXuS,,i,Ai) - H{Se,i\Ai) 

i=l 

n 

= ^ I{Ai, Se,i, Xi- Yi, Sd,i) - H{Se,i\Ai), (47) 

i=l 
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where (a) follows from the memoryless property of the channel -Pse|A> (b) follows from the fact that A"' = 
fi"\M), and (c) follows from the Markov chain {Y'-\ S'f^ , X"\\ Se""', M) - {Xi, Se,i, Ai) - {Yi, Sd,i) and that 
conditioning reduces entropy. 

Next we prove the constraint which does not involve rate of the conraiunication. From the standard properties 
of the entropy function, we observe that 

< H{M\A") 

H{M\A") - S2, M|r", S2, + H{M\Y", S^, A") + H{S2\M, F", S2, 

< iI(M|A") - 5^, M|y", S2, A") + i?(M|y", S2) + S2), 

where (*) follows from the fact that X" = /(")(M, 5^), and /(")(•) is a deterministic function. 
Again, applying Fano's inequality to last two terms in the above inequaUty, we get 

-ne„ < i?(M|A") - S"^, M\Y", S^, A") 

= H{M, - H{S^\M, A") - (X", S^, MjF", S^, A") 

H{M,S^,X''\A'') - H{S^\M,A'') - H{X'',S2,M\Y'',S2,A'') 
^= H{M, - H{S'^\A'') - H{X'\ S^, M|r", S^, A") 

= J(X", 5:*, M; r", ^JIA") - 

J2 ^e", AT; Yu Sa,i\Y'-\sy\A^) - H{Se,i\Ai) 

i=l 

< Y,Hi"^i^Si,i\^i) - H{YuSi,i\Xi,S,,i,Ai) - H{S,Mi) 

n 

= J2 HSe,i,Xi; Yi, Sa,i\Ai) - H{Se,i\Ai), (48) 

i=l 

where (a) follows from the fact that X" = f'^'^\M, SI'), {b) holds since is independent of M given A", (c) 
follows from the memoryless property of the channel Ps^iA^ ^^d (d) follows from the Markov chain 
(F'-i, Se""', M) - {X,, Se,^, A,) - {Yi,Sd,i) and that conditioning reduces entropy. 

Let Q be a random variable uniformly distributed over {1, . . . , n} and independent of (M, A", X", 5", S*^', F"), 
we can rewrite (47) and (48) as 

1 " 

-R < - V I{Ai, Se,i, Xi- Yi, Sd,i\Q = «) - H{Se,i\Ai, Q = i) + 6n + en 
i=l 

= I{AQ,Se,Q,XQ-YQ,Sd,Q\Q) - H{Se,Q\AQ,Q)^ 5n + en 

and 

1 " 

< - y^I{S,,i,Xi- Yi, SdAAi, Q = i)- H{SeMu Q = i) + Cn 

%—\ 

= I{Se,Q,XQ;YQ,Sd^Q\AQ,Q) - H{Se,Q\AQ,Q) + e„. 
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Now since we have that Ps,,qA,q\Aq = PseA\A, PYQ\XQ,Se,Q,s^.Q = Py\x,s,a, and Aq - {Xq, Se,Q,Sd,Q) - 
Yq forms a Markov chain, we identify A = Aq, Sg — Se,Q, Sd — Sd,Q, X = Xq, and F = 1q to finally obtain 

R < liA, Se, X; Y, Sd\Q) - H{Se\A, Q) + <5„ + e„ 
and < I{Se, X; Y, Sd\A, Q) - H{Se\A, Q) + e„, 

for some joint distribution 

PQ{q)PA\Q{a\q)Ps,,Sd\A{Se,Sd\a)Px\A,S^,Qix\'^,Se,q)PY\x,S,,Si{y\x,Se,Sd). (49) 

From the joint distribution in (49) and the derivation of (38)-(41) (see Lemma 3), we have the Markov chains 
Q-A- (5e, Sd) and Q - (X, A, S,) - (F, Sd). 
To this end, we note that under any distribution of the form above, we have 

I{A, S„ X- Y, Sd\Q) - H{S,\A, Q) = H{Y, Sd\Q) - H{Y, Sd\A, X, S„ Q) - H{S,\A, Q) 

< H{Y, Sd) - H{Y, Sd\A, X, Se, Q) - H{SM, Q) 

H{Y, Sd) - H{Y, Sd\A, X, Se) - H{Se\A) 
= I{A,Se,X;Y,Sd)-H{Se\A), 

and 

I{Se, X- Y, Sd\A, Q) - H{S,\A, Q) = H{Y, Sd\A, Q) - H{Y, Sd\A, X, S,, Q) - H{Se\A, Q) 

< H{Y, Sd\A) - H{Y, Sd\A, X, S,, Q) - H{S,\A, Q) 
H{Y, Sd\A) - H{Y, Sd\A, X, S^) - H{SM) 

= I{Se,X;Y,Sd\A)-H{SM), 

where both inequalities (*) follows from the Markov chains (Y, Sd) — {X,A, Se) — Q and Se — A — Q, and the 
joint distribution of {A, Se, Sd, X, Y) is of the form 

X] PQil)PA\Qia\q)Ps^,Sa\A{Se, Sd\a)Px\A,Se,Q{x\a, Se, q)PY\X,Se,S^iy\x, Se, Sd) 

qeQ 

= PA{a)Ps„S4A{Se, Sd\a)Px\A,Se{x\a, Se)PY\X,Se,sAy\Xy «e, Sd)- 

The proof is concluded by taking the hmit n — )• oo. ■ 
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